Database system and methods

ABSTRACT

In general, in one aspect, the invention relates to a method for transaction processing. The method includes specifying metadata and storing the metadata. An index is created in response to the stored metadata. The method also includes receiving a transaction, generating an index log of changes to the index in response to the received transaction, and modifying the first index in response to the generated index log.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.10/150,763, filed on May 17, 2002, the entire disclosure of which ishereby incorporated by reference.

FIELD OF THE INVENTION

The invention relates to computer-based data processing, moreparticularly, to database management systems.

BACKGROUND OF THE INVENTION

Data in database management systems are typically stored in the form ofrecords, or tuples, each composed of a fixed number of fields, alsoreferred to as attributes. The fields of a record contain the dataassociated with that record. Frequently, database records are presentedlogically in the form of a table, with records as the rows of the table,and attributes as the columns. Systems typically store records in memoryand/or on disk or other media as a linked list, with data for eachrecord stored together. However, the data for adjacent records or evenadjacent values of the same field are not necessarily stored in anyparticular proximity or order.

The manner in which the data is stored presents inherent limitations onthe performance of database systems. For example, typically, on-linetransaction processing (OLTP) is performed using one database system orcomputer, and on-line analytical processing (OLAP) is performed onanother computer. Data typically is offloaded from a transactionprocessing system to a data warehouse, and the data in the datawarehouse is used for analytical processing. There frequently aresignificant time delays associated with the transfer of data from onesystem to another. Analytical processing frequently takes a significantamount of time. Typically, the analytical processing on a data warehouseis performed on an entire table or set of tables, even when only a smallportion of the table(s) changed. This also can be very inefficient intime and resources.

SUMMARY OF THE INVENTION

In view of the foregoing, there is a need for systems and methods tostore and manipulate data so as to avoid the inefficiencies of the priorart. Embodiments of the present invention, for example, performtransaction processing and traditional analytical processing without useof a separate data warehouse, and further, are capable of providingreal-time analytical processing of transaction data.

In general, in one aspect, a declarative, vectorized, metadata-enabled,relational query language supports heterogeneous and homogeneous datasources. The language allows for specification of vertical andhorizontal vectorized functions, and uses a construct (for example, a“mapping” function) that can be used to simply specify complex rules andrelationships such as those specified in a directed acyclic graph. Usinganother language such as SQL, a database user would typically need tospecify query terms such as FROM, WHERE, GROUP BY, ORDER BY, and HAVING,to design such a query, but here the user only specifies relationshipsbetween table columns. User's query language code is processed andconverted into metadata that describes tables and relationships betweenthe tables. That metadata is then used throughout the system to enablefeatures that previously were not possible, as part of batch processing,transaction processing, OLAP and SQL query processing, internal andexternal checkpointing, and so on.

In general, in one aspect, the invention relates to a method fortransaction processing. The method includes specifying metadata andstoring the metadata. An index is created in response to the storedmetadata. The method also includes receiving a transaction, generatingan index log of changes to the index in response to the receivedtransaction, and modifying the first index in response to the generatedindex log.

The metadata can include a description of database elements andrelationships between the database elements. The metadata can alsoinclude a description of database elements required for a query. Themetadata can also include a functional description of the relationshipsbetween database elements, and can include a description of datadependencies between database elements.

The step of specifying metadata can be performed through use of a userinterface operated by a user. In one embodiment, the user interface is agraphical user interface that allows for graphical display andmodification of the metadata.

In one embodiment, the receiving step, the generating step, and themodifying step are performed by a transaction subsystem. The transactioncan be, for example, an insert, update (i.e. modify), or remove (i.e.delete) operation to a data table. Based on the metadata and the log,additional database changes required by the transaction may bedetermined.

In some embodiments, the step of modifying the first index is performedwhen the transactions are complete.

Embodiments of the method can be implemented by a database system fortransaction processing. The system includes metadata, a database forstoring the metadata, and a transaction processing subsystem or modulefor creating a first index in response to the stored metadata, receivinga transaction, generating an index log of changes to the first indexrequired by the received transaction, and modifying the first indexbased on the generated index log.

In one embodiment, the system includes a transaction manager formanaging the state of the transaction processing subsystem; an adaptermanager for receiving information on transaction sources from thetransaction manager; an adapter listener in communication with theadapter manager for receiving transaction data from a queued transactiondata source; a resource manager in communication with the adapterlistener for receiving data from an adapter listener and for loggingcollected data; a minimum recalculation engine for recalculatingportions of database tables in response to the logged collected data andthe metadata; an internal check point manager for committing loggedcollected data to a database table; and an external checkpoint managerfor publishing changes to target databases. In various embodiments therecan be one or more of each of the elements. For example, the system caninclude a second resource manager in communication with a second adapterlistener for receiving transaction data from a second queued transactiondata source and for logging collected data.

The minimum recalculation engine recalculates portions of databasetables in response to the logged collected data and the metadata.Typically, the minimum recalculation engine performs calculations onlyon data that has changed. In preferred embodiments, the minimumrecalculation engine performs table operations on indices that representresults of operations.

In general, in another aspect, the invention relates to a databasesystem. The database system includes a database storing data andmetadata describing database elements and relationships between thedatabase elements; a transaction processing subsystem for processingtransactions and updating the database; and a user interface forquerying the database and continuously providing updated query resultsas transactions are processed by the transaction processing subsystem.

There can be one or more of each of these elements; the system caninclude a second database storing data, and the transaction processingsystem can update the first database and the second database. The userinterface can include a web browser in communication with a web server,wherein the web server provides the results to the web browser fordisplay to a user.

In general, in another aspect, the invention relates to a databasesystem. The system includes a vectorized database storing table datafields linearly as a block of contiguous data; a database storingmetadata describing database elements and relationships between databaseelements; a user interface for querying the database and providingupdated query results; and a batch processing subsystem for processingbatch queries.

In some embodiments, the batch processing subsystem uses a memory modelselected from the set of vertical partitioning, horizontal partitioning,and a blend of vertical partitioning and horizontal partitioning. Insome embodiments, the system also includes a transaction processingsubsystem, such as that described above. In some embodiments, the systemalso includes a user interface as described above. The user interfacecan include a web server in communication with a web browser or applet.

In general, in another aspect, the invention relates to a method forexternal checkpointing. The method includes initially communicating adata table and a log comprising entries of data table transactions to asubscriber; and communicating additional log entries to the subscriberwhen they are received. The method includes determining that the numberof log entries is above a predetermined threshold, applying the logentries to the data table, and communicating the updated data table tothe subscriber. In sone embodiments, the subscriber is an OLAP server orother database system. Embodiments of the method can be implemented in acheckpointing subsystem that includes a transmitter for initiallycommunicating a data table and a log comprising entries of data tabletransactions to a subscriber; and for communicating additional logentries to the subscriber when they are received.

In general, in another aspect, the invention relates to a method forfault-recoverable, non-blocking checkpointing of table data. A firstcopy of a data table and a second copy of the data table are stored. Alog comprising entries of data table transactions is received. The logentries are applied to the first copy of the table. When the applicationof the log entries to the first copy of the table is complete, the logentries are applied to the second copy of the data table. The first copyof the table and the second copy of the table are swapped, and the logentries are applied to the second copy of the table.

In one embodiment, the first copy of the data table and the second copyof the data table are stored on disk. The swapping step then includesrenaming the first copy of the table and renaming the second copy of thetable. The copies can also be stored in some combination of memory andon disk. In some embodiments, the logs include entries such as inserts,edits, and deletes to the data table. In some embodiments, the step ofapplying the logs to the first copy of the table comprises modifying thecopy of the table in response to the log entries. The logs can bearchived once they are applied to the second copy of the table.

This method is also applicable to fault recovery. In such a case, thestep of receiving a log comprises receiving a log of data tabletransactions comprising entries already applied to the first copy of thetable before an interruption and entries not applied to the first tablebefore the interruption. The step of applying the log to the to thefirst copy of the table includes applying the unapplied log entries tothe first copy of the table, or the step of applying the log entries tothe second copy of the table includes applying unapplied log entries tothe second copy of the table, as appropriate.

Embodiments of such a method can be implemented in a checkpointingsubsystem configured for fault-recoverable, non-blocking checkpointingof table data. Such a subsystem can include a data store for storing afirst copy of a data table and a second copy of the data table, and areceiver for receiving logs of data table transactions. The subsystemcan also include a first updater for applying the logs to the first copyof the table; a swapper for swapping the first copy of the table and thesecond copy of the table; and a second updater for applying the logs tothe second copy of the table. The first and second update can be thesame or different updaters.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is identified with particularity in the claims. Theadvantages of the present invention may be better understood byreferring to the following description and the accompanying drawings, inwhich:

FIG. 1 illustrates an embodiment of a database system in accord with thepresent invention;

FIG. 2 depicts an embodiment of the database module 100 of FIG. 1;

FIG. 3 shows an embodiment of the user interface module 104 of FIG. 1;

FIG. 4 depicts an embodiment of the data generator module 108 of FIG. 1;

FIG. 5 shows an embodiment of the external interface module 124 of FIG.1;

FIG. 6 illustrates an embodiment of the OLAP services module 112 of FIG.1;

FIG. 7 depicts an embodiment of the batch processor module 120 of FIG.1;

FIG. 8 shows an embodiment of the transaction processor module 116 ofFIG. 1;

FIG. 9 is a flowchart illustrating an embodiment of a method forspecifying a data model and a set of computational rules accompanyingthe model in accord with the present invention;

FIG. 10 is a flowchart depicting an embodiment of a method forprocessing a g-table in accord with the present invention;

FIG. 11 is a flowchart showing an embodiment of a method for computingthe expressions underlying a combined relation specified by a user inaccord with the present invention;

FIG. 12 is a flowchart illustrating an embodiment of an iterativealgorithm for transforming source relations into an expression using thefuse, link, and loop operators;

FIGS. 13A-B show an example of the state information that is maintainedby transaction manager 802 of FIG. 8;

FIG. 14 is an example showing dependencies between tables;

FIGS. 15A-C show an example of the handling of transactions in a fuseoperation in accord with minimal recalculation methods of the presentinvention;

FIGS. 16A-B show an example of the handling of transactions in a linkoperation in accord with minimal recalculation methods of the presentinvention;

FIGS. 17A-C show an example of the handling of transactions in a loopoperation in accord with minimal recalculation methods of the presentinvention;

FIGS. 18A-B show an example of the handling of transactions in anaggregation operation in accord with minimal recalculation methods ofthe present invention;

FIG. 19 is a flowchart of a “process” function of a minimalrecalculation engine in accord with the present invention;

FIGS. 20A-B are a flowchart of a “recalc” function of a minimalrecalculation engine in accord with the present invention;

FIG. 21 is a flowchart of a “fuse” function of a minimal recalculationengine in accord with the present invention;

FIGS. 22A-B are a flowchart of a “link” function of a minimalrecalculation engine in accord with the present invention;

FIG. 23 is a flowchart of a “loop” function of a minimal recalculationengine in accord with the present invention;

FIG. 24 is a flowchart showing the application of field rules in aparallelized manner, in accord with the present invention; and

FIG. 25 shows fields of a table and backup fields of a table, used in acheckpointing operation in accord with the present invention.

In the drawings, like reference characters generally refer to the sameparts throughout the different views. The drawings are not necessarilyto scale, emphasis instead being placed upon illustrating the principlesof the invention.

DETAILED DESCRIPTION OF THE INVENTION

In brief overview, embodiments of the present invention provide adatabase system that is suitable for on-line transaction processing(OLTP) and traditional on-line analytical processing (OLAP). Furtherembodiments can provide real-time OLAP for instantaneous decisionsupport. The architecture of the database system enables updating ofOLAP requests in real-time with data received from newly-processedtransactions. This capability is realized, in part, through datavectorization and the efficient use of indexing to acceleratecalculations and database operations.

FIG. 1 presents an embodiment of a database system in accord with thepresent invention. The database system includes a database module 100 indirect or indirect communication with a user interface module 104. Thedatabase module 100 provides both traditional database and datawarehouse functionality. The database module 100 may also provide toolsfor viewing and editing metadata associated with the data contained inthe database module 100 or elsewhere. The user interface module 104serves as an access point for privileged and non-privileged users tointeract with the database system. Embodiments of the user interfacemodule 104 range from character-driven terminal systems to thin-clientgraphical user interface (GUI) systems, such as web browsers and otherclient software applications.

Depending on the functionality desired from the database system, thesystem may optionally include one or more of a data generator module108, an OLAP services module 112, a transaction processor module 116, abatch processor module 120, and an external interface module 124. Thedata generator module 108 is useful for system diagnostics. For example,the data generator module 108 may generate large, complex data sets totest the scalability of a database system and provide heuristic feedbackto the user enabling fine-grained tuning of the application. The OLAPservices module 112, transaction processor module 116, and batchprocessor module 120 interact with the database module 100 to providesupport for traditional OLAP, OLTP, and real-time OLAP.

The external interface module 124 defines a set of applicationprogramming interfaces (APIs) for third-parties to use when developingsoftware that interacts with the database system. The database systemmay also communicate with one or more external data sources 128,including both queued and non-queued data sources. Exemplary queued datasources include those provided by Tibco, MQ Series, and Tandem.Exemplary nonqueued data sources include those provided by DB2, Oracleand SQL Server, and flat data files.

The components of the database system may be implemented as individualsoftware processes in a single-processor or multi-processor computingenvironment. Particular components may be especially suited toimplementation on specialized hardware, such as a network ofworkstations (NOW), as discussed in greater detail below. It isunderstood that these components, when present, may be directly orindirectly connected in various configurations in accord with theteachings of the present invention. Accordingly, the interconnectionsillustrated in FIG. 1 are exemplary and are not to be viewed as limitingthe scope of the invention, as set forth in the claims.

FIG. 2 illustrates one embodiment of the database module 100. Thedatabase module includes a database 200 and a metadata repository 204 incommunication with parser 208 and metadata utilities 212. The database200 stores data, indices, logs, and static metadata for use by the othermodules in the system. The metadata repository 204 contains metadataassociated with the data in database 200. Typical metadata includes, forexample, defined keys, defined rules, and the dependencies betweenvarious fields in a pending query (such as sources, targets andmappings). In one embodiment, the metadata repository 208 is stored inan SQL server or other relational database, and can be accessed usingstandard SQL queries. In other embodiments, the metadata repository 208is stored in a vectorized database, such as that described below.Processes which provide system functionality can be controlled byvariables stored in metadata repository 204 whose values are set by auser via a user interface.

Database 200 stores the data used by the system, as well as the indicesand logs that are generated by the system for its internal use. Like atable in a traditional relational database, each table or “relation”stored in the database 200 is composed of columns and rows. Each rowentry in the table contains related data, also called a record, and eachcolumn in the table specifies an attribute of the record. Table data isstored column-wise, that is, as vectors with data fields stored linearlyas one large block of contiguous data (e.g., in a file). To facilitatevectorization, the length of the columns in the table is uniform, as isthe data type of the entries in a particular column.

This column-wise storage of data has numerous advantages. First, sincethe data in a column is stored contiguously, it can be read into memoryquickly, without seek time associated with reading data scattered acrossa disk. Second, since each data element in a column is of a fixed size,it is relatively simple, for example, to determine the location of thedata associated with a particular record as an offset from the start ofthe file containing the column. Likewise, it is straightforward to readselected portions of a column from disk into memory because thestructure of the data stored on the disk is the structure of the data asit is to be stored in memory. Thus, there is no need to allocate abuffer memory or perform any transformation between reading the datafrom the disk and storing the data in memory. The column-wise storage ofdata also facilitates random accesses into the data, since the datastored on disk can be traversed directly without any intermediaryloading or transformation into a memory.

In one embodiment, the system maintains range indices for each key fieldcolumn and index column stored in the database. For each fixed-size“chunk” of a column (or index), the range indices contain maximum andminimum values of the data in that chunk. This information can be usedto increase the efficiency of certain operations, such as table joins,searches, or identifying minimum or maximum values, without requiringsignificant additional storage relative to the size of a column.

For example, suppose that a column had 700 million entries, and that thechunk size for the column is 10,000 entries. The column would have arange index with 70,000 entries (which is small compared to the 700million entries of the column), with each entry storing the minimum andmaximum values in that chunk of the column. When a search is performedagainst the table looking for a particular value in that column, thedesired value can first be compared to the minimum and maximum valuesfor each chunk, and chunks for which the desired value does not fallwithin the range defined by the minimum and maximum values need not besearched, read, or accessed in any way.

The range indices may be updated as data are inserted, updated, anddeleted from the column. When the data are inserted, updated, or deletedduring transaction processing, updates to range indices are stored asedits in logs for the range indices. The use of logs to support changesto data and indices is also discussed below.

To reduce the user effort required to implement vector operations, inone embodiment the database system includes support for a functionalprogramming language. In this embodiment, the system is designed toaccept database queries in the language and, using a system of tailoreddata structures and overloaded operators, translate the queries intoefficient vector operations that are suitable for parallel processing.Other embodiments of the present invention may support traditional querylanguages such as SQL. In another embodiment, the entire system iswritten in the functional programming language, while in still otherembodiments the database system is written in a conventional programminglanguage such as C, C++, or C#.

Since the language is column oriented, the computations for each columncan be specified and performed independently. In contrast, languageslike SQL require that complex queries be created to deal with separateaggregations over many columns. For example, the following functionallanguage specification: Declare Maps (ConsolidatedPL.TickerSymbol,Positions.TickerSymbol; ConsolidatedPL. TraderID, Positions.TraderID)ConsolidatedPL.TickerSymbol = Positions.TickerSymbolConsolidatedPL.TraderID = Positions.TraderID ConsolidatedPL.IndustryName= Securities.IndustryName ConsolidatedPL.MarketValue =Positions.Holding * Securities.MarketPriceConsolidatedPL.TotalPLbyTrader = FINDSUMVAL ((Positions.Holding *(Securities.MarketPrice − Positions.PurchasePrice) +Settled.RealizedPL), ConsolidatedPL.TraderID)ConsolidatedPLTotalPLbyIndusiry = FINDSUMVAL ((Positions.Holding *(Securities.MarketPrice − Positions.PurchasePrice) +Settled.RealizedPL), ConsolidatedPL.IndustryName)

is equivalent to the following SQL code: WITH PL AS (SELECT P.TraderIDAS Trader, PTickerSymbol AS TickerSymbol, S.IndustryName AS Industry,S.MarketPrice * P.Holding AS MarketValue, P.Holding * (S.MarketPrice −P.PurchasePrice) AS UnrealizedGain, T.RealizedGain AS RealizedGain) FROM(Positions.P FULL OUTER JOIN Settled.T ON P.TraderID = T.TraderID ANDP.TickerSymbol = T.TickerSymbol) JOIN Securities.S ON P.TickerSymbol =S.TickerSymbol), ByTrader AS (SELECT Trader AS Trader,SUM(UnrealizedGain + RealizedGain) AS TotalPLbyTrader FROM PL GROUP BYTrader), ByIndustry AS (SELECT Industry AS Industry,SUM(UnrealizedGain + RealizedGain) AS TotalPLbyIndustry FROM PL GROUP BYIndustry), SELECT PL.Trader AS Trader,ID PL.TickerSymbol ASTickerSymbol, PL.MarketValue AS MarketValue, ByTrader.TotalPLbyTrader ASTotalPLbyTrader, ByIndustry.TotalPLbyIndustry AS TotalPLbyIndustry FROMPL, ByTrader, ByIndustry WHERE PL.Trader = ByTrader.Trader PL.Industry =ByIndustry.Industry

In one embodiment, the parser 208 accepts queries and transforms thequeries into directed acyclic graphs, where each node in the graphrepresents a data structure and each line in the graph represents anoperation on its connected data structures. At compile time, the parser208 can identify and eliminate circular dependencies to construct thegraphs and improved JOIN strategies. In one embodiment, the parser 208is itself written in the functional programming language.

The metadata utilities 212 provide a convenient point of access to themetadata contained in the metadata repository 204. The utilities, invarious embodiments, provide different functionality. For example, oneutility 212 ¹ exports metadata to computer-aided software engineering(CASE) tools. Another utility 212 ² draws graphs for a user using themetadata in the repository 204 to visually depict the metadata.

One embodiment of the user interface module 104 is presented in FIG. 3.In this embodiment, the user interface module 104 includes anapplication server 300 in communication with one or more executableprograms, such as OLAP GUI 304, system GUI 308, or web server 312. Inanother embodiment of the user interface module 104, the applicationserver 300 may be a gateway server and the gateway server, the OLAP GUI304 and System GUI 308 may be implemented in Java and packaged as Javaarchive (.JAR) files.

In operation, an end user interacts with the database system using a webbrowser (not shown). The browser connects with the web server 312, whichprovides the web browser with one or more files written in hypertextmarkup language (HTML). The files typically include a reference to alocation storing a program providing the functionality of OLAP GUImodule 308 or Modeller GUI 308. The user's computer downloads theprogram and executes it, permitting the user to manipulate the graphicalelements of the program. Manipulating the graphical elements sendsmessages to the application server 300 which in turn passes them to thedatabase system for processing. The application server 300 may alsoreceive completed results from the database server and provide them tothe GUI 304, 308. In one embodiment, the application server 300 alsoincludes configuration data, such as the location of the metadatarepository 204.

FIG. 4 illustrates one embodiment of the data generator 108 including adata generation server 400. In one embodiment, the data generationserver 400 generates test data sets based on a database model describedin the metadata repository 204. The data generation server 400 mayaccept inputs specifying a desired range of values and overlap and thengenerate a data set satisfying the specified values. With a test dataset of sufficient size, it is possible to explore the scalability of thedatabase model and perform quality assurance tests.

FIG. 5 illustrates one embodiment of the external interface module 124,which to facilitate interoperability with third-party software products.In one embodiment, the external interface module includes an opendatabase connectivity (ODBC) services module 500 and/or a common-objectmodel (COM) services module 504. The ODBC module 500 exposes distributedSQL and SQL ODBC interfaces to third-party products such as MICROSOFTEXCEL and MICROSOFT ACCESS. The COM services module 504 exposes a COMinterface for use with third-party software products that support COM.

Distributed Database Queries

In another embodiment, the ODBC services module 500 communicates with adistributed database system (not shown) for the evaluation of SQLqueries over a star schema. The distributed database system may beimplemented on a massively-parallel computer or a network ofworkstations having separate processors and memory. A master processseparates the tables in the database 200 into smaller data tables fordistribution and storage on individual computer nodes. The masterprocess also reduces database queries into subqueries on subsets of thetables and routes the appropriate subquery to the node having therelevant table subset. The node computes a partial result and returns itto the master process for aggregation and final assembly of the results,e.g., by processing the “group by,” “having,” and “order by” clauses ofthe SQL query.

In one embodiment, the database 200 is organized in a star schema withvery large fact tables and relatively smaller lookup tables. The numberof fact tables may be restricted to one fact table and, likewise, arestriction may be implemented prohibiting the joining of fact tablesfor the purpose of allowing specific optimizations. In this embodiment,a lookup table may, for example, be delegated to a computer node forprocessing.

The data in the tables in database 200 may be vertically partitioned,i.e., vectorized, facilitating the parallelization of database queryprocessing as described above. For example, many queries result in theexamination of all of the rows of a table and a subset of the table'scolumns when each column represents a field and each row represents anentry in the table. The data in the fact tables may also be horizontallypartitioned into sizes that are appropriate to the processing power ofthe individual computer nodes. In one embodiment, a user may manuallyspecify a minimum or maximum size for a partitioned table, or specify aload factor that expresses the computational capability, and thereforethe appropriate chunk size, of the computing node relative to the othernodes in the network.

In one embodiment where the distributed database system is implementedon a network of workstations, the system includes three sets offunctionality: a mechanism for controlling the processes executing onindividual workstations in the network, a mechanism for distributing thedata contained in fact tables to individual workstations, and a queryengine.

The processes on the individual workstations are controlled through auniversal interface. Each workstation in the network runs a process thatcan identify the available disk space or CPU time on the workstation,load new or replacement software, and start or stop individual softwareprograms. In a further embodiment, the process is fault-tolerant suchthat in the event of a system failure or restart, the process itselfautomatically restarts and resumes the execution of the software thatwas active prior to the failure.

The distribution of data is controlled by one or more data distributionservers. The data distribution server listens for updates from a datasource—e.g., a streaming source, packages the updates into a group,selects a target workstation to receive the update, publishes theupdate, waits for a response, and selects another workstation if thefirst one fails. Each workstation executes a data server that listensfor updates published by the data distribution servers; the data serversmake updates to the local, partial copies of the fact tables. In anotherembodiment, the data distribution server publishes updates to one ormore backup locations and tracks where updates have been distributed.

Queries to the distributed database system are handled by a query masterprocess that may be replicated on one or more workstations to achievefault tolerance. The query master breaks a query into subqueries,identifies the appropriate target workstations to execute thesubqueries, transmits the subqueries to the query server processes onthese machines, waits for responses from the query servers, andassembles the partial results into full results. The query servers uponreceiving the subqueries, pass the subqueries to query engine processesthat perform the actual execution of the subqueries. In one embodiment,where each workstation is a multiprocessor computer, the query engineprocess is adapted to utilize one or more of the processors in theworkstation. The query engines provide their results to the queryserver, which forwards the results to the query master. In some cases,the query master may assemble the results by a simple union of thepartial results, while in other cases, the results may be assembled bycombining the subqueries.

In another embodiment, the distributed database system also provides ainterface to the user for controlling the distribution of software tovarious workstations in the network, and for starting and stopping theexecution of software on those machines.

Referring to FIG. 6, in one embodiment the OLAP services module 112includes an OLAP server 600 in communication with a data mining module604 and a navigator module 608. The data mining module 604 supportscomplex queries intended to uncover information hidden in the data. Thenavigator module 608 permits a user to perform a “drill-down” inspectionon the data contained in database 200.

Referring to FIG. 7, in one embodiment the batch processor module 120includes a batch manager 700 in communication with a batch metadataextractor 704, a heterogeneous transaction manager propagator 708, oneor more source manager(s), generically 712, and one or more targetmanager(s), generically 716.

The source managers 712 receive data from one or more sources andconverts the data to an internal database format called a delta ord-table, also discussed below. In one embodiment, each source manager712 is adapted to receive data from a particular data source.

The batch metadata extractor 704 retrieves computational rules from themetadata repository 204 for processing by the batch manager 704. Thebatch manager receives the rules and the delta tables and applies themto construct relations of derived values called gammas or g-tables in aheterogeneous, vectorized format.

Transaction Processing Subsystem

Referring to FIG. 8, the structure of an embodiment of a transactionprocessing subsystem in accordance with an embodiment of the presentinvention is described. Transaction processing subsystem 800 includesvarious processes including the transaction manager 802, which isgenerally the first of these processes to be started when transactionprocessing begins. Transaction manager 802 is responsible for startingall other required processes in transaction subsystem 800, for managingthe state of transaction processing subsystem 800 (described below), andfor assigning work to the individual processes of the transactionsubsystem.

Adapter manager 806 receives information on transaction sources fromtransaction manager 802, and sets up an adapter listener 808 for eachsuch source. When adapter listeners 808 are started, adapter manager 806informs transaction manager 802 that the adapter listeners 808 areready.

Each adapter listener 808 receives data from a queued transaction datasource, such as Tibco, MQ Series, Tandem, Aleri IPC, or other datasources, databases, or systems that are capable of providing transactioninformation. Each adapter listener receives data from one or more ofthese sources, converts the data into the form that is used to representtransactions within transaction subsystem 800, forms blocks of thetransaction data, and forwards the blocks of transaction data to aresource manager 804. In various embodiments, it is possible for alistener 808 to receive data from multiple sources, and for multiplelisteners 808 to be use to receive data from multiple sources.

The size of the blocks of transaction data that are sent from an adapterlistener 808 to a resource manager 804 can be an adjustable parameterthat is “tuned” according to hardware and software configuration. Anadapter listener 808 may collect transactions until a particularpredetermined number of transactions are collected, and then send ablock of transaction data to resource manager 804 and/or an adapterlistener 808 may send all the transactions that are collected during aspecified period of time. For example, an adapter listener could be setto send transactions when it has collected 200 transactions or when onesecond has elapsed since the last time that is sent a block oftransactions, whichever occurs first.,

Each resource manager 804 is responsible for collecting transactions andfor logging collected transaction data for one source table of“transaction” type. Transaction manager 802 starts one resource manager804 for each such transaction source table. Generally, each resourcemanager receives blocks of transactions from an adapter listener 808,synchronously stores the transactions to disk, and sends notificationthat the blocks have been stored to transaction manager 802, along withinformation on where the blocks of transactions can be retrieved forfurther processing. Once this information has been sent, resourcemanager 804 can continue its operation in a non-blocking manner.

Each minimal recalculation engine 810 is responsible for taking a set oftransactions, typically in the form of edits to tables, and “running” atleast a portion of a model on the edits. The transaction manager 802 maystart one or more minimal recalculation engines 810 to runsimultaneously. The number of minimal recalculation engines 810 that maybe started at once typically is a configurable parameter of transactionmanager 802. Transaction manager 802 manages these minimal recalculationengines 810 so that calculation of a model is parallelized at the tablelevel. Each minimal recalculation engine 810 operates on indices andlogs to recalculate portions of a model that are affected by thetransactions, rather than recalculating the entire model. Each minimalrecalculation engine 810 notifies the transaction manager 802 when it isfinished with intermediate results, so that the transaction manager 802may determine which portions of the model can be computed given theavailable sources, intermediate results, and system resources, and sendthose portions to one or more minimal recalculation engines 810.

Internal checkpoint manager 812 is responsible for the non-blockingtwo-phase auto-commit type check point functionality required by minimalrecalculation engines 810. Typically, an internal checkpoint manager 812and one or more slave processes are started by transaction manager 802when transaction subsystem 800 is started.

External checkpoint manager 814 is responsible for publishing changes toselected targets, and to other applications and processes. Typically,one external checkpoint manager 814 is started by transaction manager802 when transaction subsystem 800 is started.

System Operation

In operation, database transactions may result in the verticalpartitioning of individual columns from the database 200 into vectordata structures that are themselves stored as individual files in thedatabase 200. These vectors may be further subdivided and farmed out toindividual processors or workstations for processing. Vectorizationfacilitates parallel processing because typically the same operationwill be performed on every data entry in the vector. Moreover, avectorized or subdivided data set is more likely to fit in thehigh-speed cache memory of a processor, reducing the need for externalmemory accesses, and more likely to fit in the main memory of a computersystem, reducing the need for disk accesses that can slow downprocessing.

Batch Processor Operation

Referring to FIG. 9, using the System GUI, a traditional text-basedinterface, or other equivalent interfaces, a user may specify a datamodel and a set of computational rules accompanying the model (Step900). The model sets out such elements as source tables, target tables,table keys, mappings between tables, and rules. The system uses thisspecification to construct a series of operations, e.g., joins andaggregations, that can be used to transform the input tables into thetarget model (Step 904). Having established the set of operations, thesystem executes them and provides the results to the user. In certainembodiments, the results are updated in real-time with the results ofnewly-processed transactions (Step 908).

The system associates every table with a unique table identifier, i.e.,an index such as a numerical value, by automatically constructing ananalyzing a graph structure. Similarly, each field in each table isassociated with a unique field identifier, i.e., a second index.

The relations that serve as inputs are referred to as d-tables ordeltas. A d-table typically includes one or more fields derived from anexternal source, such as a Sybase table, a text file, or a vectorizedtable native to the system. In some embodiments, one or more of thefields in the d-table may be computed fields.

Intermediary, derived tables are referred to as g-tables or gammas. Thefields in a g-table are derived from the rules and one or more tablesthat the user has previously specified. The input tables I that form aparticular g-table G, i.e., the set of relations I(G), are the tableswhose fields are referenced in the computational rules which define G.For example, if G has one field which is computed from fields containedin tables E and F, then it is true that I(G)={E, F}. The combination ofd-tables, g-tables, and tables derived from g-tables automaticallydefine an internal graph structure.

Every table T has one or more key fields K(T) that uniquely identify therecords in T. For example, assume a table of employees T1: SSN NameDepartment Salary 100-01-0000 Bob Sales 100 020-02-9999 Bob Development200 007-87-1523 Ken Marketing 150 976-81-1829 Joan Sales 300 172-67-9163Brenda Development 80 658-17-8743 Dennis Development 300

In this example, one possible K(T1) is the employee's social securitynumber. Similarly, assuming that no person works for more than onedepartment, a second possible K(T1) is the pair of columns (Name,Department). The database system maintains a key table in database 200containing the keys for all of the tables used in the graph.

The input tables I(G) for a particular gamma are related by a mappingsbetween the individual input tables. Referring to the employee table T1,assume a table T2 identifying the managers of the departments specifiedin T1: Department Name Sales Joan Development Dennis

and a table identifying the locations of each department: DepartmentCity Sales Paris Development Santa Fe Research Sausalito

The user may define a mapping: Map(Employee.Department,Manager.Department). This indicates that the entries in these particularcolumns are drawn from the same set of data. The Map operator isreflexive, i.e., Map(Manager.Department, Employee.Department) isequivalent to Map(Employee.Department, Manager.Department). The Mapoperator is also transitive, in that defining anotherMap(Employee.Department, Location.Department) permits the system toinfer that Map(Manager.Department, Location.Department). Formallyspeaking, a mapping M exists between two fields F and G if there is apath H[1], . . . , H[n] with F=H[1], H[n]=G, and either n=1 and F=G, orn>1 and for each i between 1 and (n−1), either Map(H[i], H[i+1]) orMap(H[i+1], H[i]) is a mapping declared by the user.

As mentioned above, the user may also specify computational rules to,e.g., create column entries in a gamma. For example, to derive a g-tableT3 containing each employees' manager, a user may defineMap(EmployeeManager.SSN, Employee.SSN), indicating that the entries inthe SSN column in the EmployeeManager relation must be drawn from theentries in the SSN column of the Employee relation. Then, the user couldspecify computational rules, such as:EmployeeManager.SSN:=Employee.SSNEmployeeManager.Name:=Employee.NameEmployeeManager.ManagerName:=Manager.Name

When these mappings and rules are compiled, the system attempts to usevarious types of table joins (e.g. Fuse, Link, and Loop joins, describedbelow) to combine tables as necessary to enable the calculation of thespecified rules. For example, the result of the example rules specifiedabove would be: SSN Name ManagerName 100-01-0000 Bob Joan 020-02-9999Bob Dennis 007-87-1523 Ken <null> 976-81-1829 Joan Joan 172-67-9163Brenda Dennis 658-17-8743 Dennis Dennis

The system stores a table map in database 200 that contains all of themappings between all of the tables in the graph and has the structure: tu M(t, u) M(u, t) - - —----- —----- . . . . . . . . . . . . T U f, . . ., g h, . . . , i . . . . . . . . . . . .

Referring to FIG. 10, the processing of each g-table G occurs in sixsteps. First, all of the input tables I(G) are combined into a singletable S(G) (Step 1000). Next, S(G) is exploded into a single table ES(G)(Step 1004). Duplicate fields in ES(G) are discarded to form tableUES(G) and a synonym table Z(G) is constructed from ES(G) (Step 1008).UES(G) is partitioned into PUES(G) (Step 1012), and G is constructed byapplying the computational rules to PUES(G) and Z(G) (Step 1016).Lastly, the selection proposition is applied to G (Step 1020).

Given the input tables I(G), the step of combining I(G) into S(G) (Step1000) can be achieved by first applying fuse, link, and loop operationsaccording to the algorithm described below.

Having completed the combination step (Step 1000), the resulting tableS(G) contains all of the fields in the input tables I(G), which mayresult include one or more duplicate fields. Duplicate fields will beduplicated when the S(G) is Exploded (Step 1004), resulting in anexploded table ES(G) containing fields with identical values.

The Explode operator replicates the rows of a table to accommodate therearrangement of the rows of a subset of the table fields into a singlecolumn. Typically, the Explode operator includes the following steps:

-   -   a. The columns on which explode is done get eliminated.    -   b. The rows get replicated as many times as the number of        columns on which the explosion is done.    -   c. Two new Columns get added to the result: “Field Name”, “Field        Value”. (The user may choose different column names).    -   d. The “Field Name” columns become the additional field of the        key.

All the values in the columns on which the Explode is done are being putinto the “Field Value” column so that the values in the “Field Value”columns correspond the respective column names in the “Field Name”column, based on which columns the values are taken from. (During thisoperation all the values, which go into the “Field Value” column getconverted to character, if they are of different data type, otherwisethe datatype is preserved). The following example illustrates theExplode operator (key fields are marked by “” K1* K2* F1 F2 F3 F4 A 1New York 10 10^(th) Avenue 555-55-55 A 2 Chicago 20 Shore Drive222-56-78 A 3 New York 30 5^(th) Avenue 555-12-12 B 1 Paris 10 Place DeLa 123-45-67 Concorde B 2 London 11 Trafalgar 999-99-99 Square B 3Moscow 12 Red Square 444-44-44 C 0 St. 13 Palace Square 111-11-11Petersburg

Assume that the table is exploded on the fields F2,F3, and F4, producingthe following result: K1* K1* Field Name* Field Vlaue F1 A 1 F2 10 NewYork A 1 F3 10^(th) Avenue New York A 1 F4 555-55-55 New York A 2 F2 20Chicago A 2 F3 Shore Drive Chicago A 2 F4 222-56-78 Chicago A 3 F2 30New York A 3 F3 5^(th) Avenue New York A 3 F4 555-12-12 New York B 1 F210 Paris B 1 F3 Place De La Concorde Paris B 1 F4 123-45-67 Paris B 2 F211 London B 2 F3 Trafalgar Square London B 2 F4 999-99-99 London B 3 F212 Moscow B 3 F3 Red Square Moscow B 3 F4 444-44-44 Moscow C 0 F2 13 St.Petersburg C 0 F3 Palace Square St. Petersburg C 0 F4 111-11-11 St.Petersburg

The Implode operator is the opposite of the Explode operator. The sameexample above could be used to illustrate an Implode, except the sourceof Implode would be the second table. The Implode operator includes thefollowing steps:

-   -   a. One key column is chosen as the fields names source.    -   b. One column is chosen as a fields values source.    -   c. As a result of elimination of the two columns, the remaining        table should have rows duplicated, as many times as the number        of unique values in the first column, otherwise implosion is        impossible. Producing a set of distinct rows eliminates the        duplication.    -   d. The new key is the previous key without column in (1).    -   e. As many columns as the number of distinct values in the        column in (1) gets added to the result of (3), the names of the        columns being the values of the column in (1).    -   f. The values of the columns are populated from the column in        (2), so that the correspondence between the columns names and        the values is the same as the correspondence between the values        in the columns (1) and (2), row-wise, and the correspondence        between the remaining key and the values in the column (2) is        preserved.

The next step in the processing of the g-table is to construct adictionary Z(G) and a reduced table UES(G) (Step 1008). Equivalenceclasses of fields are formed, for example, (f, . . . , g), (h, . . . ,i), etc. The first field in the group—here f or h—is selected torepresent all of the fields in the group. The remaining fields in eachgroup are deleted from ES(G), leaving only the representative fields inthe result table UES(G).

The dictionary Z(G) is constructed with variables whose names match theoriginal fields of ES(G): Z.f:′f, . . . , Z.g:′f, Z.h:′h, . . . ,Z.i:'h, etc. The value of Z.x is the name of the representative field ofthe group to which X belongs. The dictionary Z(G) improves performanceby reducing the partitioning of multiple copies of the same data and isused for logical reasons to ensure that the name of every field in everyinput table I(G) is recognized as it appears in the rules for G.

After construction of the dictionary, the reduced table is partitioned(Step 1012) into relation PUES(G). Using the table map for I(G) and G,one can determine whether K(G) is also a key for UES(G). If K(G) is nota key for UES(G), we can select the partitioning to force K(G) to be akey for UES(G) by partitioning UES(G) on K(G), converting the remainingfields of UES(G) into lists of vectors.

The next step in table processing defines each field in G by evaluatingthe rule defining that field over PUES(G) (Step 1016). Since PUES(G) isa partitioned table, the execution of the rules results in aggregatedfields, i.e., vector fields, and not fields which are lists of vectors.In one embodiment, if the execution of a rule results in an atomicvalue, then that value is reshaped to a vector with a length equal tothe cardinality of G whose entries are the atomic value, repeated. Inanother embodiment, if the execution of a rule results in a value whichis a list of vectors, then a run-time “depth” error is signalled. In yetanother embodiment, this latter error is detected at compile time by asemantic analyzer contained in system GUI 308.

First, the rules for K(G) (the keys of G) are evaluated. If theresulting table contains empty records, then these empty records areremoved from both G and PUES(G). Then the rules for the remaining tablesare ordered based on dependency relations among the fields they define.For example, if a value G.gross is defined to be G.price*G.quantity,then G.quantity and G.price are defined before G.gross is computed.Dependency cycles in the rules are not allowed. Lastly, the selectionproposition Y is applied to G (Step 1020), keeping only those records inG which satisfy Y. The result is saved in the database 200, with eachfile corresponding to an individual vector, and can be used as input toadditional g-tables.

FIG. 11 is a flowchart illustrating the algorithm for computing theunderlying expressions representing the combined relation specified bythe user. First, the source relations are determined (Step 1100). Next,a series of steps is determined for computing a relationship betweentables (Step 1104). Lastly, an algorithm is iteratively applied to thesource relations to build an expression using the fuse, link, and loopoperators (Step 1108).

Given a table computed by rules, the source relations can be determined(Step 1100) by examining the rules and identifying all of columns usedby the rules. For example, in a rule using the dotted identifierEmployee.SSN the column is “SSN” and “Employee” is the relation. Moreprecisely, suppose the rules for a particular derived relation T havethe form: T.fl := <expression l> ... T.fn := <expression n>and the fields occurring in <expression 1>, . . . , <expression n> areS1.g1, . . . , Sm.gm. Then, the collection of source relations is theset of distinct identifiers in the collection S1, . . . , Sm.

Having identified the source relations, it is now possible to deduce,for any pair of relations, one of four possible relationships: “1 to 1”,“1 to M”, “M to 1”, and “M to M” denoting, respectively, one-to-one,one-to-many, many-to-one, and many-to-many relationships (Step 1104).First a collection of sets is initialized. Given the complete set ofmappings between all relations (not only source relations):M={Map(s1.f1,t1.g1), . . . , Map(sn.fn,tn.gnp)}

It is possible to identify the set of sets of column E using thefollowing algorithm. First, initialize E to be the set {{s.f}|s.f is acolumn in one of the tables}. Thus, E is a set of singleton sets, eachhaving one field. Also, set N to M.

Next, assume that the first element in N is Map(e1, e2), where e1 and e2are columns. Then, let K1 be the element of E such that e1 is an elementof K1; and K2 be the element of E such that e2 is an element of K2. SetE=E−{K1, K2} union {K1 union K2} and N=N−{Map(e1,e2)}. Afterwards, if Nis empty, then stop the process. Otherwise, process the next element inN.

As an example, consider the previously-defined Employee, Manager, andLocation relations with the mappings:Map(Employee.Department,Manager.Department)Map(Manager.Department,Location.Department)

To calculate the collection E of sets, E is initialized to:E={{Employee.SSN},{Employee.Name},{Employee.Department},{Employee.Salary},{Manager.Department},{Manager.Name},{Location.Department},{Location.City}}

In light of the first Map, E becomes:E={{Employee.SSN},{Employee.Name},{Employee.Department,Manager.Department},{Employee.Salary},{Manager.Name},{Location.Department},{Location.City}}and after the second Map, E becomesE={{Employee.SSN},{Employee.Name},{Employee.Department,Manager.Department,Location.Department},{Employee.Salary},{Manager.Name},{Location.City}}

Thus, the algorithm forces Employee.Department and Location.Departmentto be related because they end up in the same set or equivalence class.

A fuse is a one-to-one join, also called a “full outerjoin” inrelational databases. The operation in expressions is “Fuse”. Forinstance, the expression:Fuse(Manager,Location)

represents the full outer join of the Manager and Location relations,where the join is based on the key columns Manager.Department andLocation.Department. Given the data above, this relation is: DepartmentName City Sales Joan Paris Development Dennis Santa Fe Research <null>Sausalito

There may be more than two relations in a Fuse. The key field of a Fuseis the key field of any of the relations; by convention, we set it to bethe key field of the last relation in the Fuse expression. Thus, the keyfield for the Fuse above is the Location.Department column.

A link is a one-to-many join. The operation in expressions is “Link”.There may be more than one relation that is linked “one” to the “many”relation. The “many” relation is called the driver relation, and appearslast in the Link expression. For instance, in the expression:Link(Manager,Employee)

the driver relation is Employee, which is in a many-to-one relation withManager (each row in Manager matches possibly many rows in Employee,using the key columns). The relation computed by this link is, given theabove data: SSN Name Salary Manager.Name 100-01-0000 Bob 100 Joan020-02-9999 Bob 200 Dennis 007-87-1523 Ken 150 <null> 976-81-1829 Joan300 Joan 172-67-9163 Brenda 80 Dennis 658-17-8743 Dennis 300 Dennis

Similarly, the expression:Link(Manager,Location,Employee)specifies Employee as the driver relation, and returns a relation withone more column (the City column). Another similar expression is:Link(Fuse(Manager,Location),Employee)where again Employee is the driver relation. The key field of a Link isthe key field of the driver relation.

Finally, a loop is a many-to-many join. The operation in expressions is“Loop”. Loop expressions involve only two relations, e.g., Loop(Relation1, Relation2). The key field of a Loop is the union of the keyfields of the two constituent relations, with duplicate mapped keysremoved from this set.

The subroutine for deducing the relationship between S and R, for anyrelations S and R, uses this set E={E[1], . . . , E[m]} using thefollowing steps. First, the key fields of S and R are identified. Forexample, assume that they are either declared or computed to be S.a1, .. . , S.am and R.b1, . . . , R.bp. Using these key fields, determine thesets K and L such that:K={u|if S.Bv is in E[u] for some v and u, then some column of R is alsoin E[u]}L={u|if R.bv is in E[u] for some v and u, then some column of S is alsoin E[u]}

After this computation, if K=L, then the relation between the sourcerelations is “1 to 1,” i.e., a one-to-one relationship. If L is a propersubset of K, then the relation is a “1 to M”, i.e., a one-to-many,relationship. If K is a proper subset of L, then the relation is a “M to1”, i.e., a many-to-one relationship. If K and L are incomparable assets, then the relation is a “M to M”, i.e., a many-to-manyrelationship.

Again, for example, consider the set E constructed above and assume thatit is desirable to deduce the relationship between the relationsEmployee and Manager. The key field of Employee is Employee.SSN and thekey field of Manager is Manager.Department. Following the processoutlined above, Employee.SSN is in the first set of E andManager.Department is in the third set of E. Thus K={ } and L={3} and Kis a proper subset of L, so there is a “M to 1” relationship betweenEmployee and Manager.

For another example, assume that it is desired to deduce therelationship between Manager and Location. The key fields areManager.Department and Location.Department. Thus, K={3} and L={3}, sothere is a “1 to 1” relationship between Manager and Location.

Once the relations are deduced (Step 1104), an iterative algorithm isapplied to the source relations to build an expression using the fuse,link, and loop operators (Step 1108). The algorithm is illustrated inFIG. 12 and begins with a set C={S1, . . . , Sm} of distinct identifiers(Step 1200). Without loss of generality, the identifiers can be assumedto be distinct since they are contained in a set.

First, assuming that C={T1, . . . , Tk}, then set D=C and E to be theempty set. Defining the first element in D to be V, then solve for allrelations in D−{V} that are “1 to 1” related to V, using the algorithmin Step 1104 (Step 1204). If there are no such relations, set E=E union{V} and D=D−{V}. Assuming that there are “1 to 1” relations U1, . . . ,Ur, then set E=E union {Fuse(U1, . . . , Ur, V)} and D=D−{U1, . . . ,Ur, V}. If D is not the empty set, solve for the “1 to 1” relations forthe next entry in D, otherwise set C to E, and proceed to the next step.

Again assuming that C={T1, . . . , Tk}, then set D=C and E to the emptyset. Define the first element in D to be V. Now, find all relations inD−{V} that are “1 to M” related to V, using the algorithm of Step 1104(Step 1208). If there are no such relations, set E=E union {V} andD=D−{V}. For each “1 to M” relationship U1, . . . , Ur, set D=D−{U1, . .. , Ur, V} union {Link(U1, . . . , Ur, V)} union E and E to the emptyset. If D is not the empty set, then solve for the “1 to M” relationsfor the next element in V, otherwise set C to E and proceed to the nextstep.

Again assuming that C={T1, . . . , Tk}, then set D=C and E to the emptyset. Define the first element in D to be V. Now, find all relations inD−{V} that are “M to M” related to V, using the algorithm of Step 1104(Step 1212). If there are no such relations, set E=E union {V} andD=D−{V}. For each “M to M” relationship U1, . . . , Ur, set D=D−{U, V}union {Loop(U, V)} union E and E to the empty set. If D is not the emptyset, then solve for the “M to M” relations for the next element in V,otherwise set C to E and proceed to the next step.

If, after solving for the “1 to 1”, “1 to M”, and “M to M” relations,the set C has not changed or has only one entry, then stop, otherwisethe algorithm repeats (Step 1216). If at the end C has more than oneelement, an error of “not enough mappings” is returned as the result;otherwise, the one element of C is returned.

Other embodiments of this algorithm detect anomalous or erroneoussituations. For example, in one embodiment if the user defines a Mapbetween a non-key in one relation and a non-key in another relation, an“exaggerated mapping error” is reported. In another embodiment, if therelationship between the combined relation and the target relation is “Mto M” or “1 to M”, as computed using the algorithm in Step 1104, then anerror is reported. In still another embodiment, if the relationshipbetween the combined relation and the target relation is “M to 1”, thenthe Modeler sets the aggregation flag for the target relation. Also, ifany of the rules for the key columns in the target relation are notsimple field expressions-that is, the rules do not have the formt.f:=s.gwhere t is the target relation, f is the column in the target relation,s is the source relation, and g is the column in the source relation,the Modeler sets the aggregation flag for the target relation. This kindof rule is called a “key transformation.”

The aggregation flag implies further checking. If the aggregation flagis set, the computational rules for the target relation are checked.These rules must have enough “aggregation” on non-key fields in order toensure that the target relation can be constructed.

Operation of Transaction Processing

The rules just described can be used throughout the system, and areparticularly useful in enabling transaction processing. Bypredetermining the relationships between the tables desired for use inqueries, for example, the transaction processing subsystem can calculatethe table changes required by transactions as transactions come into thesystem, even if the changes are to be reflected in the arbitrarilycomplex table relationships. This is demonstrated by the operation oftransaction processing.

During transaction processing, a series of transactions affect thesource tables (also called d-tables or delta tables), making changes tothe data in those tables. When propagated through the model, thesechanges to the source tables may cause changes in the target tables. Thedatabase system of the present invention is capable of quickly handlingmost transactions, so that analysis based on a combination of historicaldata combined with incoming transactions is possible.

As discussed hereinabove, in operation, one or more adapter listeners808 receive incoming transactions from a variety of sources. Thesetransactions are collected into batches, and sent on to a resourcemanager 804, which stores the batches of transactions, and sends thebatches of transactions on to transaction manager 802.

Transaction manager 802 implements a state machine that keeps track ofthe batches of transactions that are received, groups the batches oftransactions for processing, and sends the batches of transactions toone or more minimal recalculation engines 810 for processing. The statusof the processing of batches of transactions is tracked, so transactionmanager 802 can determine when to send further batches of transactionsto minimal recalculation engine(s) 810, and so that transactionprocessing can be resumed if an interruption occurs.

Referring now to FIGS. 13A-13B, an example of the state information keptby transaction manager 802 is described. Each row in state table 1302includes a batch ID field 1304, that contains a unique batch ID of eachbatch of transactions to be processed, a NumRecs field 1306, thatcontains the number of transaction records contained in a particularbatch of transactions, a state field 1308, that keeps track of the stateof each batch of transactions, and a group ID field 1310, that keepstrack of which group of batches of transactions are to be processed atonce. In addition to these fields one embodiment of the presentinvention includes timing fields (not shown) with the state information.The timing fields may include fields for each group of batches oftransactions that track the amount elapsed time since the group wasformed, the amount of time spent calculating, and the amount of timespent checkpointing. This timing information may optionally be used forfine-grain tuning of the system, and for optimization heuristics ondirected graphs.

An entry in state field 1308 may contain the value “done”, whichindicates that the batch of transactions has been processed, “proc”,which indicates that the batch of transactions is being processed,“wait”, which indicates that the batch of transactions is waiting to beprocessed, “checkpoint1”, which indicates that the batch of transactionsis in a phase of checkpointing in which a rename operation is occurring,or “checkpoint2”, which indicates that the batch of transactions is in anon-blocking phase of checkpointing, in which logs are applied to abackup or “shadow” copy of the tables and indices.

The checkpointing will be described in detail hereinbelow, withreference to FIG. 25. During the phase of checkpointing during which therename operation is being performed (“checkpoint”), the minimalrecalculation engine(s) 810 are unable to start processing a new groupof batches of transactions. When a group of batches of transactionsreaches the “checkpoint2” state, the minimal recalculation engine(s) 810can start processing a new group of batches of transactions.

In the example state table shown in FIG. 13A, the transaction batcheswith IDs 1 and 2 are “done”, the transactions batches with IDs 3, 4 and5 are in the “checkpoint1” state, and the transaction batches with IDs6, 7 and 8 are waiting to be processed. This state information is usedby transaction manager 802 during its operation, and may assist inrecovering the state of transaction manager 802 if transactionprocessing is interrupted.

The entries in group ID field 1310 are used by transaction processor 802to identify groups of batches to be processed at once. In this example,because the group ID is 2 for the batches of transactions with batch IDs3, 4, and 5, these batches of transactions were sent to a minimalrecalculation engine 810 as one large group of transactions.

In FIG. 13B, state table 1302 is shown a short time later. Now, thetransactions batches with IDs 3, 4 and 5 are in the “checkpoint2“state,which is non-blocking. This permits transaction manager 802 to assignminimal recalculation engine(s) 810 to process the next group of batchesof transactions. Accordingly, the batches with batch IDs 6, 7 and 8, allof which are processed as a group (with group ID 3), are now beingprocessed, and have value of “proc” in state field 1308.

Referring to FIG. 14, an example dependency graph of tables is shown. Itwill be understood that this example graph is an illustrative example,and that the actual dependency graphs handled by the system generallyare more complex.

In the example, table T3 1402 is dependent on table T1 1404. Table T171406 is dependent on table T2 1408. Table T22 1410 is dependent on tableT3 1402, and T19 1412 is dependent on tables T3 1402 and T17 1406. Dueto these dependencies, transaction manager 802 can assign minimalrecalculation engines to work on transactions affecting tables T1 1404and T2 1408 in parallel, since neither depends on the other. No othertables may be computed until at least one of these tables has finished,even if more than two minimal recalculation engines 810 are available.

Once T1 1404 is finished, T3 1402 can be started in parallel with otherminimal recalculation tasks that are being handled. T19 1412 cannot bestarted until both T3 1402 and T17 1406 have completed. If there aremore tables ready to be processed than there are minimal recalculationengines 810 available to process them, the tables that are ready may beassigned arbitrarily to minimal recalculation engines 810.Alternatively, techniques such as use of weighted graphs can be used toplace an order on the assignment of tables that are ready to minimalrecalculation engines 810, to provide greater throughput.

The dependencies which are used to order and parallelize computation areavailable in the metadata, and are based on the mappings that weresubmitted by the user. Such parallelization, based on the availabilityof needed data (i.e. data flow) is also possible at the field level.

When the changes to a target table have been determined by minimalrecalculation engine(s) 810, those changes may be applied in anon-blocking manner by internal checkpoint manager 812, and published toapplications that “subscribe” to the target tables by externalcheckpoint manager 814.

Operation of the Minimal Recalculation Engine

Minimal recalculation engine 810 handles recalculation of target tablesthat are specified in the model when transactions change the sourcetables on which the target tables depend. The minimal recalculationengine makes use of the table rules that determine the manner in whichsource tables are joined to produce target table, and the application offield rules that compute the values of fields based on the values ofother fields.

As described above, relations between tables can be one-to-one,one-to-many, or many-to-many. Each of these relations may be representedby creating a combined table from tables in which the related fields arelocated. The operation used in a table rule to combine tables depends onthe nature of the relation. For example, for a one-to-one relation, afuse operation (also called a full outer join) is used to combinetables. For a one-to-many relation, a link operation (also called a leftor right outer join) is used, while for a many-to-many relation a loopoperation is used.

The combined table discussed above could be virtual (i.e., representedthrough sets of indices showing which row in each table involved in atable rule contributes to which row of the combined table). There is oneindex per table, the length of which is the number of rows in thecombined table. Each entry in the index corresponds to a row of thecombined table, and contains the number of the row of the table itrepresents that contributed to that row of the combined table.Alternatively, if the memory model permits, the combined table may bephysically produced.

As transactions enter the system, the transactions will affect thevalues in tables, which in turn, will affect the values in, and possiblythe dimensions of, other tables that are defined in the model to dependfrom the modified tables, for example as the result of fuse, link, andloop operations or other functions. The tables that are subject to theseoperations may be very large. In addition, models of even moderatecomplexity may involve application of table rules that call for numerousfuse, link, and loop operations to be applied to join the various sourcetables into a target table. If each transaction that alters a sourcetable resulted in complete recalculation of all tables that are definedin the model to depend on tables modified by the transaction, it wouldnot be possible to handle transactions very quickly. To provide therapid updates that are required for decision support based on incomingtransactions, it may be necessary, for example, to handle hundreds orthousands of transactions per second.

Minimal recalculation engine 810, performs only a relatively smallnumber of calculations for each received transaction. Minimalrecalculation engine 810 uses indices, and logs of changes to theindices to propagate transaction changes made to a “source” table usinga complex table rule in a model without performing a completerecalculation. For each transaction that changes a source table, arelatively small number of changes may be needed to indices thatrepresent combined tables resulting from fuse, link, and loop operationscalled for in the model. Minimal recalculation engine 810 computes thechanges to the indices and to the combined table (rather thanrecomputing the entire tables), and creates logs of these changes, whichmay be applied (asynchronously) by the checkpoint managers 812 oftransaction processing subsystem 800.

As is shown in the following examples, the fuse, link, and loopoperations of minimal recalculation engine 810 create logs of changes toindices that represent the combined tables, rather than recreating theactual combined tables. This reduces the required computation, since itis more efficient to add an entry to a log of an index than to build anentire table, particularly when the tables contain a large number (e.g.millions) of entries. In addition, since application of a rule in amodel may not require the entire combined table in order to generate thechanges to its target that result from a transaction changing a sourcetable, the minimal recalculation engine can use only the necessaryportions of a source or combined table that are needed to computechanges to target tables.

In the following examples, transactions will generally be referred to as“edits” to a source table. The edits are found in a log for the sourcetable, and can indicate that inserts, updates, or deletes should beapplied to the source table. As described above, in one embodiment, theedits are reviewed by a listener and provided to the minimalrecalculation engine 810 by the resource manager 804. In general, anedit comprises an operation (e.g. insert, update, or delete), an index,which indicates the position (counting from 0) in the table or index atwhich the edit is applied, and zero or more data fields, indicating thedata that is to be inserted or updated at the position indicated by theindex. An insert edit (denoted with an “i” in logs shown in examples)indicates that an item should be inserted into a table at a positionspecified by the index. An update edit (denoted with a “u” in the logsshown in the examples) indicates that one or more values of the datafields of a row indicated by the index are to be changed. A delete edit(denoted with a “d” in the logs shown in the examples) indicates thatthe row specified in the index is to be deleted. In response to theseedits to a source table, minimal recalculation engine 810 generatesedits to the indices associated with the source tables, and the combinedtables.

In one embodiment, delete edits are transformed into update edits. A“physical” delete, that actually removes the row from the table, istransformed into two updates. The first update moves the data fields ofthe last row of the table to the position of the row that is beingdeleted. The second edit changes the data fields of the last row of thetable to “null” values. Such null rows can easily be truncated from theend of the table. This approach to deleting rows ensures that null rowsfrom deletes accumulate at the bottoms of all the tables and indices,permitting a very inexpensive truncation operation to be used to removethe null entries at the ends of the tables. Moving the last row of thetable into the position of the deleted entry prevents the system fromhaving to perform the potentially expensive operation of moving all rowsbelow the deleted entry up by one row.

A second type of delete, referred to as a “logical” delete may also beused. A “logical” delete is transformed into a single update, thatchanges the values all of the non-key fields of the deleted row to“null”. Thus, using a “logical” delete, the deleted row remains in thetable, but its non-key data are removed. Alternatively, a “logical”delete may be handled by the system as if it were an update that changesthe values of all non-key fields in the row to “null”, without actuallytransforming the operation into an update.

In one embodiment of the database system, the type of delete that isused is an adjustable parameter, permitting users to specify whetherdeletes are to be handled as “physical” deletes or as “logical” deletes.

The order in which edits are processed may also vary. In one embodimentof the system, edits have different priorities. For example, deletescould be given the highest priority, so all deletes are handled first.Updates have the second highest priority, so all updates are handlednext. Inserts have the lowest priority, so they are handled last. Thesepriorities may vary, and may be specified in the metadata. If themetadata does not specify priorities, then all operations may be givenequal priority, so that edits will be handled in the order that they arereceived.

In an embodiment that uses such priorities, edits can be handleddifferently, or even ignored based on their priority, and on the otheredits that are in a log. The following procedure may optionally beapplied to handle prioritized edits. Note that this procedure assumesthat delete edits are handled as “logical” deletes, and are directlyprocessed by the system, rather than first being transformed. First, asearch is performed for each new edit, based on its key. If the edit isnot found in the table, the edit is transformed into an insert. Notethat this generally only affects updates for which a key is not found.If the key is found, the priority for the edit is compared to thepriority of the match that was found (assuming that the match was foundin the logs). If the priority is greater than or equal to the match,then the edit is applied without a change (though a “duplicate key”warning may be issued in the case of insert and delete edits). If thepriority of the edit is less than the priority of the matched element,then the edit is ignored.

This optional procedure may be useful in cases where the order of editsreceived by the system is not the same as the order in which thetransactions occurred. By using priorities in conjunction with theprocedure provided above, a reasonable order may be given to the edits,and edits that will have no effect (e.g., because the row they modifyhas been deleted) may be ignored. This priority-based handling of editsmay optionally be performed early in the process of handlingtransactions, such as in an adapter listener 808, resource manager 804,or in other processes in transaction handling subsystem 100.

The following examples demonstrate how edits coming into the system as aresult of transactions are propagated through the operations that formtable rules (i.e. the fuse, link, and loop operations). In accordancewith the present invention, applying an edit to a source table that istransformed by a join operation (i.e. a fuse, link, or loop) causes aset of zero or more edits to be added to logs associated with theindices that represent a combined table that results from such a joinoperation, and to a log associated with the combined table.

Referring to FIGS. 15A-15C, an example of a fuse operation with updatesdue to transactions in accordance with the present invention isdescribed. FIG. 15A shows the starting state of tables T1 (1500) and T2(1502), and of combined table TC (1504), which is the result of applyinga fuse operation to T1 (1500) and T2 (1502). It should be noted thattable TC (1504) is shown here for illustrative purposes only, and neednot actually be constructed, due to the use of T1 INDEX (1506) and T2INDEX(1508), which represent the combined table TC (1504). TC (1504) isa “virtual” table, in the sense that it does not actually exist inmemory.

All of table T1 (1500), table T2 (1502), index T1 INDEX (1506), andindex T2 INDEX (1508) as shown in FIG. 15A typically would have beengenerated during batch processing or other initialization stages, andwould already exist at the time that transaction processing begins. Useof minimal recalculation engine 810 permits application of relationshipsspecified in a model to perform analysis of historical data, such as isshown in FIG. 15A, in combination with incoming transaction data, aswill be described below.

As can be seen in FIG. 15A, the fuse operation takes two tables (in thiscase, T1 (1500) and T2 (1502), which are mapped one-to-one on a keyfield in each table (in this case, key field K1 (1510) in T1 (1500) andkey field K2 (1512) in T2 (1502)), and creates a single table whosefields are the union of the fields of the two tables (in this case,combined table TC (1504)). As mentioned, the combined table TC can bevirtual, and exist only in the form of the indices T1 INDEX 1506 and T2INDEX 1508. These indices that represent the combined tables indicatewhich rows from the source tables map to the rows of the combined table.For example, T1 INDEX (1506) contains {0 1 2 - - - }. This indicatesthat row 0 (counting from 0)of the combined table TC (1504) containsvalues from the fields of row 0 of T1 (1500), row 1 of the combinedtable contains values from the fields of row 1 of T1, row 2 of thecombined table contains values from the fields of row 2 of the combinedtable, row 3 of the combined table does not correspond to a row in T1,and row 4 of the combined table does not correspond to a row in thecombined table. Similarly, T2 INDEX (1508), which contains {0 - - - 2 13}, indicates that row 0 of the combined table TC (1504) contains valuesfrom the fields of row 0 of T2, row 1 of the combined table does notcorrespond to a row of T2 (since T2 does not contain the key “B”), row 2of the combined table contains values from the fields of row 2 of T2,row 3 of the combined table contains values from the fields of row 1 ofT2, and row 4 of the combined table contains values from the fields ofrow 3 of T2.

Referring now to FIG. 15B, the results of performing a transaction thataffects one of the source tables of the fuse operation are described. InFIG. 15B, an example transaction 1520 calls for an insert, which is anew row to be inserted into table T1 (1500). This insert is representedby an entry 1520 in T1-LOG 1522. Entry 1520 is an edit having anoperation-type “i”, for insert, index 3, indicating that the item is tobe inserted as row 3 of table T1 (i.e. after the current last row,counting from 0), a key value K1 of “D”, and other values associatedwith table T1, as listed in entry 1520.

Insertion of a new record (or row) into T1 will affect the combinedtable formed by the fused combination of T1 and T2. Therefore, toprocess the edit in T1-LOG 1522, minimal recalculation engine 810 willgenerate T1 INDEX-LOG 1524, to contain edits to T1 INDEX 1506, T2INDEX-LOG 1526, to contain edits to T2 INDEX 1508, and COMBINED LOG1528, to contain edits to the combined table TC 1504.

Since the key K1 in entry 1520 is “D”, which is not already present intable T2 1502, insertion of entry 1520 into T1 1500 will cause theaddition of a new row in the combined table, which means that insertionof entry 1520 requires insertion of a new row into both T1 INDEX 1506and T2 INDEX 1508. Minimal recalculation engine 810 determines therequired inserts to the indices, and adds them to T1 INDEX-LOG 1524, andT2 INDEX LOG 1526, respectively. As can be seen, in this example, aninsert edit at row 5 with a value 3 is added to T1 INDEX-LOG 1524,indicating that T1 INDEX 1506 should have a 3 inserted in (new)row 5(i.e. row 5 of the combined table takes values from row 3 of T1). Aninsert edit at row 5 of null (“−”) is added to T2 INDEX-LOG 1526 (i.e.row 5 of the combined table does not correspond to a row of T2, since T2does not contain the key “D”). Minimal recalculation engine 810 alsogenerates an entry in COMBINED LOG 1528, indicating that a new row 5should be inserted, with values as shown.

Although combined table TC 1504 does not actually exist in memory, beingrepresented by T1 INDEX 1506 and T2 INDEX 1508, since relatively fewoperations are required to generate the entry in COMBINED LOG 1528, asan optimization, COMBINED LOG 1528 is actually generated, even though itcould be generated from Ti LOG 1522, T1 INDEX-LOG 1524, and T2-INDEX LOG1526. Additionally, it should be noted that in actual operation, theedits in the combined table log will typically contain only theoperation and index, until an entire table rule (which may includenumerous fuse, link, and loop operations) is applied, after which thedata fields will be added to the final combined log. The data fields areshown here for illustrative purposes.

In general, an insert of a row that has a completely new key (i.e. a keythat is new across all the tables that are being fused), such as isshown in the example of FIG. 15B, causes inserts to be added to each ofthe table index logs, and an insert to be added to the combined tablelog. An insert of a row having a key that is present in one of the othertables that is being fused causes an update to be added to the index logof the table being edited, and an update to be added to the combinedtable log.

Referring now to FIG. 15C, a different transaction that causes changesto the original data shown in FIG. 15A is described. In FIG. 15C, entry1540 of T1-LOG 1542 indicates that row 1 (counting from 0) of T1 1500should be updated to change the value of F2. This edit will not requireany change to T1 INDEX 1506 or T2 INDEX 1508, since the keys are notaltered. The only change that must propagate through the fuse operationis an update to the data in the combined table to reflect the change inthe data in table T1 1500. Therefore, minimal recalculation engine 810need only add an entry to COMBINED LOG 1544 updating the values in theappropriate row of the combined table.

In general, an update of a non-key in any table that is being fused willcause an update to be added to the combined table log. Updates to thekeys of tables that are being fused are not allowed, except when theyare caused by a delete, as described below.

Deletion of a row from one of the source tables of a fuse operation maybe accomplished using, for example, a “physical” delete, in which thedelete is handled by transforming the delete edit into two update edits.The first update moves the data from the last row of the table in whicha delete is occurring to the row that is being deleted. The secondupdate changes the last row of the table to contain all “null” values.Such null entries of the table may be truncated from the end of thetable.

Note that an update to a key occurs as a result of transforming thedelete into two edits. As described above, such an update would notnormally be permitted. An exception is made in the case of an updateresulting from a delete edit. Generally, an update to a key in a fuseoperation resulting from a delete will require updates to the indices ofthe tables being fused, and to the combined log.

FIGS. 16A-16B show a link operation, which is used to join tables when aone-to-many relation is present. In FIG. 16A, the starting state oftables T1 (1600) and T2 (1602), and of combined table TC (1604), whichis the result of applying a link operation to T1 (1600) and T2 (1602) isshown. As before, table TC (1604) is shown for illustrative purposesonly, and need not actually be constructed in full, due to the use of T1INDEX (1606) and T2 INDEX(1608), which represent the combined table TC(1604). Also as before, all of table T1 (1600), table T2 (1602), indexT1 INDEX (1606), and index T2 INDEX (1608) as shown in FIG. 16A wouldhave been generated during batch processing, and would already exist atthe time that transaction processing begins.

As can be seen in FIG. 16A, the link operation takes two tables (in thiscase, T1 (1600) and T2 (1602), where a key (field K2 1610 of table T21602 in this example) of one of the tables is mapped one-to-many to alink field (field F1 1612 in this example) of the other table. Theresult is a table in which the records of the “one” table (T2 1602 inthis example) have been replicated to match those of the “many” table(T1 1600 in this example). Thus, in the combined table TC 1604, each rowin which F1 field 1612 matches K2 field 1610, the values correspondingto the row containing that value of K2 are replicated into combinedtable TC 1604. As in the previous example, the indices T1 INDEX 1606 andT2 INDEX 1608, that represent the combined table indicate which rowsfrom the source tables map to the rows of the combined table.

Referring now to FIG. 16B, the results of performing an update edit andan insert edit that affect one of the tables being joined in the linkoperation are described. Entry 1620 in T1-LOG 1622 is an edit thatupdates row 1 (counting from 0) of table T1 1600. Because the value ofthe link field F1 1612 would be changed by this operation, the values inT2 INDEX 1608 may need to be updated. The needed update to T2 INDEX 1608as a result of this update edit are added as entry 1624 in T2 INDEX-LOG1628, indicating that row 1 of T2 INDEX 1608 should be updated to thevalue 1 (i.e. row 1 of the combined table will contain values from row 1of T2). It is also necessary to add entry 1630 to COMBINED LOG 1634,indicating the changes to row 1 of combined table TC 1604 as a result ofthe update.

In general, an update of a link field in the “many” table causes updatesto be added to the index logs of all the “one” tables to contain theappropriate matching row values. Additionally, it causes an update to beadded to the combined table log. An update to a non-key field of the“one” table causes updates to be added to the combined table log foreach position in the combined table that corresponds to a position inthe index of the one table that contains the row number of the row thatwas updated. Updates to key fields of the “one” table are typically notpermitted, unless they occur due to a delete operation. Such updates area special case, in which minimal recalculation may not be the fastestapproach.

Entry 1621 of T1-LOG 1622 is an insert edit that would add a new row totable T1 1600. This insert causes addition of an insert edit as entry1636 in T1 INDEX-LOG 1638, and an insert edit as entry 1626 of T2INDEX-LOG 1628. Additionally the insert transaction causes entry 1632,an insert edit, to be added to COMBINED LOG 1634.

In general, an insert of a row containing a link field value into the“many” table causes inserts to be added to the index logs of all of the“one” tables, an insert to be added to the index log of the “many”table, and an insert to be added to the combined log. An insert of a newrow into a “one” table may cause numerous updates to be added to theindex log of that table, and the same number of updates to be added tothe combined log, or nothing. This is because the new key fields maymatch the link fields which did not have any matching key fields before,and thus had nulls in the index vector of the “one” table wherever thatlink field appeared in the “many” table. On the other hand, the new keyfields may not have any matches in the “many” table, in which casenothing needs to be done.

FIGS. 17A-17C illustrate a loop operation, which is used to combinetables when a many-to-many relation is present. In FIG. 17A, thestarting state of tables T1 (1700) and T2 (1702), and of combined tableTC (1704), which is the result of applying a loop operation to T1 (1700)and T2 (1702) is shown. As before, table TC (1704) is shown forillustrative purposes only, and need not actually be constructed infull, due to the use of T1 INDEX (1706) and T2 INDEX(1708), whichrepresent the combined table TC (1704). Also as before, all of table T1(1700), table T2 (1702), index T1 INDEX (1706), and index T2 INDEX(1708) as shown in FIG. 16A would have been generated during batchprocessing, and would already exist at the time that transactionprocessing begins.

As can be seen in FIG. 17A, the loop operation takes two tables (in thiscase, T1 (1700) and T2 (1702), where a first link field (field F1 1710of table T1 1700 in this example) of one of the tables is mappedmany-to-many to a second link field (field F4 1712 in this example) ofthe other table. The result is a table having a row for each combinationof matching values of the first and second link fields in the twotables. Thus, in the combined table TC 1704, for each row in which avalue of F1 field 1710 matches at least one value of F4 field 1712, thecombined table TC 1704 will have a row combining the values of rows fromT1 1700 and T2 1702 for each element in F4 field 1712 that matches thevalue of F1 field 1710. As in the previous examples, the indices T1INDEX 1706 and T2 INDEX 1708, that represent the combined table indicatewhich rows from the source tables map to the rows of the combined table.

Referring now to FIG. 17B, the result of performing an insert edit thataffects one of the tables being joined in the loop operation isdescribed. Entry 1720 in T2-LOG 1722 is an edit that inserts a new rowat the end of table T2 1702. Because this adds an additional value tofield F4 1712 in T2 1702 that corresponds to two values in field F1 1710in T1 1700, two new rows will be added to the combined table. This isreflected in the addition of two insert edits in T1 INDEX-LOG 1724, twoinsert edits in T2 INDEX-LOG 1726, and two insert edits in COMBINED LOG1728.

For edits on many-to-many table operations, an insert of a rowcontaining new or existing link field values into one of the tables,such that the new values are not present in at least one of the othertables involved in the many-to-many relation causes no effect. An insertof a row containing a new or existing link field value into one of thetables, such that the link field value is present in other tablesinvolved in the many-to-many relation causes a number of inserts to beadded to the index logs of each of the tables, and to the combined tablelog, equal to the product of the number of occurrences of the values inthe other tables.

In FIG. 17C, the result of performing an update edit that affects one ofthe source tables of the loop operation is described. Entry 1730 inT1-LOG 1732 is an edit that updates a non-link field of table T1 1700.Because this only changes the values of fields in the combined table,but not the positions in table T1 1700 from which those values aretaken, the update has no effect on T1 INDEX 1706 or T2 INDEX 1708.Accordingly, there are no additions to T1 INDEX-LOG 1734 or T2 INDEX-LOG1736. The update will require an update to the combined table everyplacewhere the entry that is being updated occurs in the combined table.Since the value being updated occurs in combined table TC 1704 atindices 0 and 2, updates are added to the combined table log withindices 0 and 2, with the updated value.

In general, an update value to a non-link field in a loop operationcauses as many updates to be added to the combined table log as thereare occurrences of the row that was updated in the combined table. Ifthe updated row is not present in the combined table, no updates will beadded to the combined table log. Note that an update to a link field ina loop operation is a special case that could lead to many edits beingadded to the logs of the table indices and the combined table. Ingeneral, such updates should only be permitted if they result from adelete edit.

Referring now to FIGS. 18A-18B, an example of an aggregation operationis shown. Such aggregation operations are used to represent many-to-onerelations, and occur only after the table rule has been applied, toconvert a combined table (which results from application of variousfuse, link, and loop operations) into the target table. Unlike the fuse,link, and loop operations which were described above, aggregationoperations are used to compute values for fields in a target table. Whentransactions affect the table on which the aggregation operation occurs,these changes must be propagated through the aggregation operation, toaffect the values of the target fields.

In FIG. 18A, the starting state of tables T1 1800 and the target tableTARGET 1802 are shown. It should be noted that in this example, T1 is acombined table that may have resulted from operation of the join rulesdiscussed above, such as fuse, link, and loop. In this example, thefield F1 1806 in table T1 1800 is mapped many-to-one to the field K 1808in the table TARGET 1802. To perform this mapping, all of the values infield K1 1804 that have the same values in field F1 1806 must becombined to provide a value for field F1 1810 in table TARGET 1802. Inthis example, a SUM operation is used to combine the values from fieldK1 1804. Thus, in the row of table TARGET 1802 having a value in field K1808 of “A”, the value of field F 1810 will be the sum of all of theelements in field K1 1804 in table T1 1800 that have “A” as the value offield F1 1806.

In addition to the starting states of tables T1 1800 and TARGET 1802,FIG. 18A shows the starting states of numerous indices that are usedperform the aggregation operation, and to handle transactions thataffect table T1 1800. Reverse index 1812 contains entries for each rowin table T1 1800. These entries have values that indicates which rows oftable TARGET 1802 contain the result of aggregating the correspondingrows of table T1 1800. In this example, reverse index 1812 contains {0 01 2 1}, indicating that rows 0 and 1 of table T1 1800 will be aggregatedinto row 0 of table TARGET 1802, rows 2 and 4 of table T1 1800 will beaggregated into row 1 of table TARGET 1802, and row 3 of table T1 1800will be aggregated into row 2 of table TARGET 1802.

The process of grouping rows of a combined table (such as table T1 1800)that have the same value in the “partitioning field” (a field that ismapped to a key field in the target table—field F1 1806 in this example)is known as “partitioning” the table. For each distinct value in thepartitioning field, a “partitioning group” exists, that associates allrows of the many table that have the same value in their partitioningfield. Generally, there will be one such partitioning group for each rowin the target table of an aggregation operation. Thus, an alternativeway to describe reverse index 1812 is to say that it contains thepartitioning group number for each row of the many table. Thus, in thisexample, reverse index 1812 could be said to specify that rows 0 and 1of table T1 1800 are in partitioning group 0, rows 2 and 4 are inpartitioning group 1, and row 3 is in partitioning group 2.

Lengths index 1814 contains an entry for each partitioning group,specifying how many rows of the many table are in each partitioninggroup. In the example, lengths index 1814 contains {2 2 1}, indicatingthat partitioning group 0 contains two rows from table T1 1800,partitioning group 1 contains 2 rows, and partitioning group 2 contains1 row.

Beginnings index 1816 contains an entry for each partitioning group,specifying the first row in the many table at which a member of thatpartitioning group is found. Thus, in the example shown in FIG. 5A,beginnings index 1816 contains {0 2 3}, indicating that row 0 is thefirst row in table T1 1800 containing a member of partition group 0, row2 is the first row containing a member of partitioning group 1, and row3 is the first row containing a member of partitioning group 2.

Ends index 1818 is similar to beginnings index 1816, containing an entryfor each partitioning group that specifies the last row in the manytable at which a member of that partitioning group is found. Forexample, since the last row of T1 1800 that has a value of “B” in fieldF1 1806 (i.e. partitioning group 1) is row 4, the entry in ends index1818 at index 1 (counting from 0) is 4.

As in previous examples, the starting values in table T1 1800, tableTARGET 1802, reverse index 1812, lengths index 1814, beginnings index1816, and ends index 1818 will have been computed during batchprocessing, and will already be present when transaction processingbegins.

Referring now to FIG. 18B, the results of performing a transaction thataffects table T1 1800 are described. Entry 1830 in T1-LOG 1832 is anedit resulting from a transaction that will cause a new row to be addedto table T1 1800, with K as the value of field F1 1806, and 6 as thevalue of field K1 1804. Because there is already a row in table T1 1800that has “K” as the value of field F1 1806 (row 3), there is already apartitioning group for rows with “K” in field f1 1806 (partitioninggroup 2). Thus, an insert at index 5 (the index of the new row), with avalue of 2 (the partitioning group of the new row) is added to reverseindex log 1834.

Because the insert adds a member to partitioning group 2, the length ofpartitioning group 2 will increase. Thus, an update edit is added tolengths log 1836, with an index of 2 (the partitioning group for the newrow), and a value of 2 (the new length of partitioning group 2). The newrow also will cause an update to be added to ends log 1838, having anindex of 2 (the partitioning group of the new row), and a value of 5(the index of the new row). No changes are needed to beginnings index1816, so the insert will not cause any entries to be added to beginningslog 1840.

Finally, the insert will cause a change to the target. To determine thenew value of field F 1810 in table TARGET 1802, the sum of all thevalues of field K1 1804 in table T1 1800 that are in rows belonging topartition group 2 must be computed. This can be achieved by gatheringall the values of field K1 1804 for rows in partitioning group 2, andadding them together. Alternatively, this can be done incrementally, byadding the value of field K1 1804 from entry 1830 to the existing valueof field F 1810 in row 2 of table TARGET 1802. The result of performingthe SUM operation is 10. An update edit is added to TARGET-LOG 1842 tochange the value of field F 1810 in row 2 to 10.

The examples shown above demonstrate that through use of indices andlogs, edits to source tables that result from transactions may bepropagated through table rules that use fuse, link, and loop operations,as well as through an aggregation operation that may occur when thecombined table that results from application of a table rule istransformed into a target table. Once a table rule has been applied, asdemonstrated above, field rules that compute the values of computedcolumns and fields may be applied.

The minimal recalculation engine performs the operations demonstratedabove on batches of edits provided by transaction manager 802 throughuse of a set of algorithms that are described below. In one embodimentof the invention, the edits (inserts, updates, and deletes) may beregrouped according to operation as they enter the minimal recalculationengine process. That is, the edits may be separated into a block ofinserts, a block of updates, and a block of deletes. Following thisregrouping, the minimal recalculation engine may process each of theseblocks of edits in a vectorized manner, first processing the block ofinserts, then the block of updates, and then the block of deletes.

The following procedure may be used to facilitate this vectorization.First, each edit has an index number assigned to it. This index is thenumber of the row (counting from 0) of the table that is affected by theedit, and will be entered in the log of the table, along with the typeof the edit. If the edit is an insert, the index value is the currentsize of the table, including any inserts in the current log that havenot yet been checkpointed. If the edit is an update or delete, thecurrent table is searched (using the range indices to optimize thesearch) for the key of the edit. The table log is also searched. If thekey is found in the table itself, the index field of the edit is theposition where the key is found in the table. If the key is found in thelog, the index field is the value of the index field of the log entry inwhich the key is found. Note that this procedure may optionally be usedin step 1914 of the “process” function of the minimal recalculationengine, as described below.

In an embodiment that uses this vectorization, the deletes are notactually done until the checkpointing, and all the inserts are at thebottom of the table, so the row numbering of the rows in the tables doesnot change until checkpointing occurs. This permits the edits to beseparated into blocks of inserts, updates, and deletes, and to beprocessed in blocks without regard to the relative chronology betweenthe inserts and updates, updates and deletes, and inserts and deletes.The chronology of the edits within each block is preserved.

Referring to FIG. 19, a flowchart for the “process” function isdescribed. The “process” function is executed by the minimalrecalculation engine to start processing a batch of edits (which areprovided as a parameter to the “process” function).

At step 1902, the system extracts from the edits a vector thatidentifies all of the tables that are being edited. These are the“source” tables for processing the batch of edits.

Next, at step 1904, the system identifies all the tables that will beaffected by the edits, and the other tables that affect those tables. Toidentify these tables, the system consults the mappings that are storedin the metadata repository 204 for the model that is being applied. Theset of tables produced at this step includes: (a) all the tables thatdepend (directly or indirectly) on the tables being edited; and (b) allthe tables on which the tables identified in (a) directly depend. Atstep 1906, the system sets the vector N to the tables identified in step1904.

In step 1908, the system initializes the vector SIZES, the length ofwhich is the count of N, to contain all 0s. The SIZES vector willcontain the sizes of all the intermediate tables between the sources andthe targets.

In step 1910, the system identifies the calculated fields in each of thesource tables (i.e. the tables being edited). These calculated fieldsdepend only on other fields in the same table. For example, if a sourcetable T1 has columns F1 and F2, and a column F3 where each entry in F3is computed as the sum of a corresponding entry in columns F1 and F2,then F3 would be a calculated field in T1. In step 1912, the necessarycalculations are performed to produce the calculated fields of thesource tables.

At step 1914, the batch of edits is converted into logs for the sourcetables (i.e. the tables being edited). These logs will be used to jointables using fuse, link, and loop operations, as shown the examplesdescribed hereinabove.

At step 1916, the vector S, which contains status information on each ofthe tables, is initialized. The length of S is the count of N, and S isassigned a value of “done” (which may be represented by a 2) for each ofthe source tables, and a value of “not started” (represented by a 0) foreach of the other tables. Note that S can also contain the value “inprocess” (represented by a 1) for tables that are currently beingprocessed.

Next, while not all entries in S are “done” (step 1918), the systemidentifies all the tables in N that depend on tables that have a statusin S of “done”, and stores them in the vector R (step 1920). In step1922, the “recalc” function is called for each table in R. The “recalc”function, which will be described in detail below, applies table andfield rules to recalculate a table using minimal recalculation methods.In step 1924, the system notifies internal checkpoint manager 812 of thereadiness for checkpointing of each table in R.

When all entries in S are “done”, then the “process” function hasfinished processing the batch of edits that it was sent, and exits. Theminimal recalculation engine that was executing the “process” functioncan now take a new batch of edits.

Referring now to FIGS. 20A-20B, the “recalc” function is described.Generally, the recalc function takes a table, denoted T, and appliestable rules and field rules to recalculate T. In step 2002, the systemsets the entry in S (the status vector) that corresponds to the table Tto “in process”.

In step 2004, the system retrieves the table rule for T from themetadata repository 204, and stores it in TR. As discussed above, thetable rule for T generally consists of a set of fuse, link, and loopoperations that may be applied to other tables to build T as a combinedtable. In addition to applying fuse, link, and loop operations, it maybe necessary (if, for example, there is a key transformation asdescribed above or an aggregation) to perform a partitioning.

Next, in step 2006, the system initializes COMB(T), the combined tablelog for T to an empty table with columns for operation and index. Notethat during application of the table rule, only the operation and indexare kept for the combined table log. The data fields of the combinedtable are added into the combined table log only after edits have beenpropagated through the table rule. This saves time by preventingneedless adding of data fields to the combined log for entries that willbe affected or removed by the application of further operations in thetable rules.

In step 2008, the table rule TR is evaluated. This may requireapplication of the “fuse”, “link” and “loop” functions describedhereinbelow to apply the various fuse, link and loop operations tosource tables that are required to build the table T.

In step 2010, the data fields are added to the combined table logCOMB(T), based on the table indices, and the data in the source tables.This populates the combined table log with data from the tables that arebeing joined to form the combined table.

Next, in step 2012, the system runs the field rules for key fields ofthe target table. To perform this step, field rules for the key fieldsof the target are retrieved from metadata repository 204. Thecalculations specified in the field rules are then applied, to calculatevalues for the key fields. Because the values of key fields may beaffected by application of such field rules, the range indices thatapply to the key fields (e.g., to facilitate rapid searches of the keyfields) may need to be updated. These range indices are updated duringtransaction processing by adding entries to ranges logs, in a mannersimilar to the way in which other tables and indices are altered throughuse of logs during transaction processing. Thus, in step 2013, edits maybe added to the ranges logs to change the range indices, if necessary.

If there is no partitioning (step 2014), the field rules are applied,adding new columns to the combined table log, COMB(T) (step 2016).

In step 2018, if there is partitioning, which may be caused if there isan aggregation (i.e. many-to-one) operation to be applied, then thepartitioning procedure, described with reference to 20B, is applied. Aswill be described more fully below, the partitioning procedure is aportion of the recalc function that handles the special indices, such asthe reverse index, and the length, beginnings, and ends indices that maybe associated with a partition or aggregation operation.

Next, in step 2020, the system produces a log for T, which takes theoperations and indices from the combined table log COMB(T), but takesthe data from COMB(T) only for the fields that are defined in the tableT. Source and intermediate fields that are not defined in the table T,but which may be needed to compute the fields in T are not moved fromCOMB(T) to the log of T.

Finally, in step 2022, the entry in the status vector S that correspondsto table T is set to “done”.

Referring now to FIG. 20B, the partitioning procedure, which containsthe additional steps of the recalc function that should be applied ifthere is partitioning, is described. At step 2050, the system uses themetadata for T to identify the partitioning fields. The partitioningfields are the fields in the combined table that will become the keyfield(s) in the target table after the aggregation operation is applied.These are the fields by which partitioning will have to be done.

In step 2052, the system takes the partitioning field values from thecombined table log, COMB(T), and determines to which partitioning groupeach partitioning field value belongs. A partitioning group serves tocollect all rows from the combined table that have the same values fortheir partitioning field. If the operation in the combined table log isan insert, or an update that changes the value of the partition field,then determining the partition group is achieved by searching for thevalue of the partition field in an existing target table that wasgenerated during batch processing, and that has had prior updatesapplied. If the value is found, then the partitioning group number ofthe partitioning field is equal to the index (position) in the existingtarget table at which the value was found. If the operation in thecombined table log is an update in which the partitioning field valuewas not changed, then the partitioning group number can be found in anexisting reverse index, at the position corresponding to the index valuein the update edit.

At step 2054, if all entries in the combined table log, COMB(T), havebeen processed, then the system proceeds to step 2082. Otherwise, usethe following steps are used to process the next entry in COMB(T). Oneskilled in the art will recognize that looping through the individualentries is used only for illustration of the algorithm, and that it ispossible to process an entire vector of entries at once.

In step 2056, if the entry was an insert and the entry belongs to anexisting partitioning group (step 2058), then at step 2060, updates areadded to the lengths log with an index corresponding to the partitioninggroup number of the entry, and a value one greater than the old value,and an update is added to the ends log, with an index corresponding tothe partitioning group number, and a value equal to the index (position)of the entry in the combined table log.

If the entry does not belong to an existing partitioning group, then atstep 2062, the system assigns a new partitioning group number to thevalue of the partitioning field in the insert entry, and stores the newpartitioning group number in NG, for future use. In step 2064, thesystem then adds inserts to the beginnings log, the ends log, and thelengths log for the new partitioning group number. The indices of theseinserts are equal to the new partitioning group number, and the valuesof the inserts to the beginnings log and ends log are the index(position) value from the combined table log entry, and the value forthe insert to the lengths log is 1.

In step 2066, because the entry being processed is an insert, the systemadds an insert to the reverse index log, with an index (position) of theindex value from the entry, and a value of the partitioning group numberof the value. Additionally, at step 2068, updates or inserts may beadded to the log of ranges.

If the entry is an update (step 2070), and the update changes the valueof a partitioning field (step 2072), then the update may change thepartitioning group to which the entry belongs. In step 2074, if theupdate causes addition of a new partitioning group, or deletion of anexisting partitioning group (or both), then at step 2076, the systemstores the partitioning group numbers that were added or deleted in NG,for later use, and handles updating the length log, the beginnings log,and the ends log, as well as the reverse index log. If the update causesa new partitioning group to be added, then an insert is added to thelengths log, the beginnings log, and the ends log, as described abovewhen an insert caused a new partitioning group to be created. If theupdate removes an existing partitioning group, then deletions are addedto the lengths log, the beginnings log and the ends log to delete theentries that correspond to the partitioning group that has been removed.

At step 2078, an update will be added to the reverse index log, toreflect the new partitioning group of the entry. Additionally, updatesare added as necessary to the lengths log, the beginnings log, and endslog.

In step 2080, the ranges logs for each of the columns of each of thetables or indices that are affected are updated added, as necessary.

In step 2081, if the entry was not an insert or update, it is a delete.As described above, the system generally transforms deletes intoupdates. If “physical” deletion is being used, the delete is convertedinto two updates, as described. Alternatively, if “logical” deletion isbeing used, the delete is converted into an update having “null” entriesfor its values. It will be understood that use of step 2081 to handledeletes is for illustrative purposes, and that there are other mannersand places in the algorithm that handling of deletes can occur. Forexample, in one embodiment, transformation of all delete edits in ablock of edits may be handled prior to processing other edits.

All the partitioning groups have been identified, and the reverse indexlog has been built for all entries in COMB(T), so the field rules, whichaggregate the data in the entries of the partitioning groups, areapplied. In step 2082, the system applies field rules that require therecalculation of the partitioning groups in their entirety to computefield values of the target table. This is done by identifying theaffected partitioning groups based on the index column in COMB(T) andtheir corresponding entry in the reverse index (and reverse index log),and searching the reverse index and reverse index log for all the rownumbers of the combined table that are needed to assemble the data forthe partitioning group. These row numbers are then used with the tableindices of the source tables to obtain the source data for thepartitioning group. Once the source data are assembled, then anaggregation operation specified in the field rule is applied to thesource data, computing the field value in the target table.

Next, in step 2084, any incremental field rules are applied to computevalues of fields in the target table. Incremental field rules arecertain field rules that do not require the recalculation of the entirepartitioning group to calculate the aggregation operation. For example,if the aggregation rule is a SUM, the only data that are needed tocompute the sum of data in a partitioning group are the old values ofthe SUM, the old values of the updated fields, the new values of theupdated fields, and the values of the inserted fields. It is notnecessary to recalculate the entire SUM. Similarly, for aggregationoperations such as MAX and MIN, which compute the maximum and minimumvalues of the data in the partitioning group, respectively, only the oldresult, the old values of the updated fields, and the new values areneeded to perform an incremental calculation. In the case of a COUNToperation, only the old value of the COUNT, and the number of insertsand the number of deletes are needed. In short, these operations can bedone in an incremental manner, which is likely to be much more efficientthan recalculating the entire group. Depending on the operation, thesystem gathers the required data to compute the result incrementally,and computes the result of the aggregation incrementally.

In step 2086, the system computes the log for T, which will be based onthe contents of NG, which identifies new and deleted partitioninggroups, as well as updated partitioning groups (as a side effect ofidentifying new and deleted partitioning groups). This is the last stepof the partitioning procedure, after which the system will continueexecuting the recalc function.

Referring now for FIG. 21, a flow chart for the fuse function isdescribed. The fuse function is the function that propagates edits totables through a fuse operation, as described hereinabove, to generateedits to the combined table, and to the indices that can be used togenerate the combined table from the tables being joined by the fuseoperation. The input to the fuse function is a vector of “table blobs”,where each “table blob” is a pair consisting of an identifier of atable, and a log to be applied to that table.

At step 2102, the system places the table identifiers from the “tableblobs”, which may be numbers that identify tables, into the vector D.This is used to identify the tables involved in the fuse.

Next, at step 2104, the system updates an entry in the CSIZES vectorcorresponding to the combined table that will result from the fuseoperation a value equal to the current length of any of the indices ofthe tables in D. The CSIZES vector is used to store the sizes ofcombined tables. The indices of the tables in D for this fuse operationhave already been computed during batch processing, and, because thefuse operation represents a one-to-one mapping, the indices should allbe of the same length.

At step 2106, the system checks to see if all the entries in all thelogs of the tables in D have been processed. If so, the system continuesto step 2128. Otherwise, the next edit is taken from one of the logs ofone of the tables in D, and is processed. It will be understood by oneskilled in the art that the setup of this operation as a loop, handlingeach edit one-at-a-time is for illustrative purposes, and that thesystem may handle numerous edits in the logs at once, as a vector ofedits. Such vector operations, that handle entire groups of items suchas edits in a log at once, are applied throughout a preferred embodimentof the system of the present invention.

If the edit is an insert (step 2108), the system checks to see if thekey of the item being inserted is present in any of the other tablesbeing fused (step 2110). If so, at step 2112, the system determines therow number in the first table where the key is found, and in step 2114,determines where this row number is found in the index vector for thattable, and stores the position value in K.

At step 2116, an update is added to the index log of the table intowhich the insert is made. The update will be at index K, and has a valueof the current size of the table (i.e. the position of the next elementto be added to the table, since the table positions count from 0).Additionally, an update gets added to the combined table log with anindex equal to K.

At step 2118, if the key was not present in any of the tables in D, aninsert is added to index log of the table into which the original insertedit was made, with an index of the size of the combined table, and avalue of the size of the table. Additionally, an insert is added to thecombined table log with an index equal to the current size of thecombined table (taken from CSIZES). Finally, the value in CSIZEScorresponding to the combined table is incremented by one.

At step 2120, the value in SIZES for the table into which the insert wasmade is incremented by one.

If the edit is an update (step 2122), then, in step 2124, an updateentry is added to the combined table log with an index equal to theindex position of the updated row number in the table index vector ofthe updated table. Note that this procedure for handling updates assumesthat the update does not affect a key value. As explained above, exceptwhen they result from a delete entry, updates to keys are forbidden inthe fuse operation. If a legitimate update to a key field in a fuseoperation occurs, it is necessary to find any matches of the old keyvalue, and add updates or deletes to change them in the index logs andthe combined table log, and find matches to the new key value, and addupdates or inserts, according to whether such matches were found, to theindex logs and the combined table log.

In step 2126, if the edit was not an insert or update, it is a delete.As described above, the system generally transforms deletes intoupdates. If “physical” deletion is being used, the delete is convertedinto two updates, as described. Alternatively, if “logical” deletion isbeing used, the delete is converted into an update having “null” entriesfor its key values. It will be understood that use of step 2126 tohandle deletes is for illustrative purposes, and that there are othermanners and places in the algorithm that handling of deletes can occur.For example, in one embodiment, transformation of all delete edits in ablock of edits may be handled prior to processing other edits.

Finally, in step 2128, the fuse function is finished, and may return.

Referring now to FIGS. 22A-22B, the link function is described. The linkfunction is the function that propagates edits to tables through a linkoperation, as described hereinabove, to generate edits to the combinedtable, and to the indices that can be used to generate the combinedtable from the source tables. As with the fuse function, the input tothe link function is a vector of “table blobs”, where each “table blob”is a pair consisting of an identifier of a table, and a log to beapplied to that table.

At step 2202 in FIG. 22A, the system places the table identifiers, whichmay be numbers that identify tables, into the vector D. This is used toidentify the tables involved in the link. The last entry in D is the“many” table, while other entries in D are “one” tables.

At step 2204, if all edits to the many table have been processed, thesystem proceeds to step 2228. Otherwise, the following steps are used toprocess the next edit to the many table. As before, one skilled in theart will recognize that looping through the individual edits is usedonly for illustration of the algorithm, and that it is possible toprocess an entire vector of edits at once.

At step 2206 the system determines if the edit is an insert to the manytable. If so, at step 2208, the system searches for the link fieldvalues in the record being inserted in the keys of each of the “one”tables. If found (step 2210), the index position which the row in whichthe link field value was found in the one table is stored in thevariable K (step 2212). At step 2214, an insert is added to the indexlog of the one table in which the link field value was found. The inserthas an index (i.e. position) of the size of the combined table (i.e. theinsert occurs at the end of the combined table), and a value of K.

If the link field value was not found, then at step 2216, an insert isadded to the index log of the one table in which the link field valuewas not found with an index of the size of the combined table, and avalue of “null”.

If the edit is an update of the many table (step 2218), then the systemattempts to find the occurrence of the new link field values in the onetables (step 2220). If found (step 2222), at step 2224, the system addsan update to the index log of the one table with an index equal to thenumber of the row being updated in the many table. The value of theupdate is the number of the row of the one table that matches the newlink field value.

If the new link field value was not found in the one table, then at step2226, an update is added to the index log of the one table with an indexequal to the number of the row being updated in the many table, and avalue of “null”.

In step 2227, if the edit was not an insert or update, it is a delete.As described above, the system generally transforms deletes intoupdates. If “physical” deletion is being used, the delete is convertedinto two updates, as described. Alternatively, if “logical” deletion isbeing used, the delete is converted into an update having “null” entriesfor its key values. It will be understood that use of step 2227 tohandle deletes is for illustrative purposes, and that there are othermanners and places in the algorithm that handling of deletes can occur.

Continuing in FIG. 22B, at step 2228, after all the edits in the log ofthe many table have been processed, an insert is added to the index logof the many table for each insert that was processed in the many tablelog. The indices and values of the inserts are the size of the manytable, with one added for each such insert.

In step 2230, the current size of the many table in the SIZES vector isincreased by the number of inserts to the many table.

Next, in step 2232, the system adds an insert to the combined table logfor each insert that was processed in the many table log. The indicesand values of the inserts will be the size of the combined table, withone added for each such insert.

In step 2234, the current size of the combined table in the CSIZESvector is increased by the number of inserts to the many table.

Next, in step 2236, if all edits to the one table(s) have beenprocessed, the system proceeds to step 2252. Otherwise, use thefollowing steps are used to process the next edit to the one table(s).As before, one skilled in the art will recognize that looping throughthe individual edits is used only for illustration of the algorithm, andthat it is possible to process an entire vector of edits at once.

In step 2238, if the edit is an insert into a one table, then at step2240, the system searches for the key value among the link fields of themany table. If found (step 2242), at step 2244 the system adds as manyupdates to the index table log of the one table as the number of matchesfound, with indices equal to the positions in the many table at whichthe key value was found, and values of the size of the one table (i.e.the position in the one table at which the new entry was added). At step2246, the system adds as many updates to the combined table log as thenumber of matches found, with indices of the positions in the many tableat which the key value was found. Note that if the key value was notfound, there is no effect.

At step 2248, the table size of the one table in the SIZES vector isincreased by 1.

If the edit is an update to the one table (step 2250), then at step2252, update entries are added to the combined table log with indicesequal to the positions in the one table index at which the one tableindex equals the row number of the one table that was updated. Note thatthis assumes that the update is to a non-key field of the one table. Inone embodiment, an update to a key field to the link table in the onetable is permitted only when the update to the one table resulted from adelete. This is a special case, and will be discussed furtherhereinbelow.

At step 2254, if the edit was not an insert or update, it is a delete.As described above, the system generally transforms deletes intoupdates. If “physical” deletion is being used, the delete is convertedinto two updates, as described. Alternatively, if “logical” deletion isbeing used, the delete is converted into an update having “null” entriesfor its key values. It will be understood that use of step 2254 tohandle deletes is for illustrative purposes, and that there are othermanners and places in the algorithm that handling of deletes can occur.

Finally, at step 2256, the link function is finished, and returns.

Referring to FIG. 23, the loop function is described. The loop functionis the function that propagates edits to tables through a loopoperation, as described hereinabove, to generate edits to the combinedtable, and to the indices that can be used to generate the combinedtable from the source tables. As with the other function that jointables, the input to the loop function is a vector of “table blobs”,where each “table blob” is a pair consisting of an identifier of atable, and a log to be applied to that table. In the case of the loopfunction, there should not be more than two tables combined by the loopfunction. If there are more than two tables to be combined in amany-to-many fashion, then the first two should be combined by a loopoperation, and then the third should be combined with the result of theloop operation, and so on.

At step 2302, the system places the table identifiers, which may benumbers that identify tables, into the vector D. This is used toidentify the two tables involved in the loop.

At step 2304, if all edits to the tables in D have been processed, thesystem proceeds to step 2326. Otherwise, the system uses the followingsteps to process the next edit to the many table. As before, one skilledin the art will recognize that looping through the individual edits isused only for illustration of the algorithm, and that it is possible toprocess an entire vector of edits at once.

If the edit is an insert (step 2306), then at step 2308, the systemsearches for the link field value in the other table from the table forwhich this edit is entered. If the link field value is found in theother table (step 2310), the system identifies the indices of thematches in the other table in step 2312. Next, in step 2314, the systemcreates a Cartesian product of the indices, including the indices of theedits in the table being edited. In step 2316, the resulting indices areadded to the index logs of the respective tables as inserts. In step2318, the same number of inserts are added to the combined table log,with indices starting at the size of the combined table before theedits, and counting up from that value. In step 2320, the size of thecombined table stored in the CSIZES vector is increased by the number ofinserts into the combined table (which is the same as the number ofinserts into the indices).

If the edit was an update to a non-link field (step 2322), at step 2324,update entries are added to the combined table log with indices equal tothe positions in the table index of the updated table where the numberof the updated row is found. Of course, if there are no such positionsin the table index, nothing will happen.

Note that an update to a link field in a loop operation is a specialcase that is preferably not permitted unless it results from a delete.If such an update does occur, it is a special case, similar to an updateof a key field in the one table in a link operation, in which extensiverecalculation may be necessary, and use of the minimal recalculationalgorithms may not be appropriate.

In step 2325, if the edit was not an insert or update, it is a delete.As described above, the system generally transforms deletes intoupdates. If “physical” deletion is being used, the delete is convertedinto two updates, as described. Alternatively, if “logical” deletion isbeing used, the delete is converted into an update having “null” entriesfor its key values. It will be understood that use of step 2325 tohandle deletes is for illustrative purposes, and that there are othermanners and places in the algorithm that handling of deletes can occur.For example, in one embodiment, transformation of all delete edits in ablock of edits may be handled prior to processing other edits.

Finally, in step 2326, the loop function is finished, and returns.

In several special cases in the algorithms described above, such asupdating a key field of the “one” table in a link operation, or updatinga link field in a loop operation, it has been noted that extensivechanges may result, and use of the minimal recalculation algorithms maynot be appropriate. In such cases, it may actually be less costly incomputation time to simply recalculate the table than to handle thevarious edits and logs that are normally used by the minimalrecalculation algorithms. In such cases, minimal recalculation engine810 may perform a complete recalculation, or, in an alternativeembodiment, recalculation can be deferred in these cases, and handled inbatch mode.

Not every such change will necessarily lead to such extensive updatesthat complete recalculation is preferable over use of the minimalrecalculation algorithms. Depending on the value being updated, eventhese special cases may require only a few changes to the resultingcombined table. Heuristics may be used to determine, in each suchspecial case, whether minimal recalculation is preferable over completerecalculation. If so, then minimal recalculation techniques, ofinserting edits in the various logs, may be used to perform the edits.

Referring now to FIG. 24, an algorithm for parallelizing the calculationof individual fields in a target table is shown. The algorithm assumesthat a certain number of processes run in parallel. In a preferredembodiment, this number is determined dynamically, based on theresources of the machine, and a user-specified set of configurationparameters. This algorithm could be used, for example, in steps 2012 or2016 of the recalc function as described herein with reference to FIG.20A.

In step 2402, the system identifies the fields for which the rulesaffecting those fields are entirely based on the fields of sourcesoutside of the table to which the field rules are being applied.

Next, in step 2404, these fields are divided into as many groups asthere are available processes to compute them. The choice of whichfields to place in which groups is preferably based on a measure of thecomplexity of the computation. One simple measure is the count of thefields. Other such measures include measures based on data gatheredduring previous calculations, measures based on the number of operationsin a calculation, or other known complexity measures.

In step 2406, since the groups of fields have been formed, the systemcauses each process to calculate one group of fields.

In step 2408, as soon as a process is done with calculating its group,the system determines if there are any fields of the target for whichall the source fields have been calculated. If not (step 2410), theprocess exits (step 2414), and its resources may be used when a processcompletes calculating and there are further fields that may now becalculated.

In step 2412, if there were fields for which all of the source fieldshave been calculated, then the fields are divided into as many groups asthere are processes available, in a similar manner to that described instep 2404. The system then loops back to step 2406.

When there are no more fields that can be calculated, all processes willhave reached step 2414, and the calculation of fields is complete.

Checkpointing—Operation of the Internal Checkpoint Manager

Once the logs for a target table are generated by the minimalrecalculation engine, the logs may be applied to update the targettable, as well as the various indices that were used to generate it.This process, referred to as checkpointing, performs the table update ina matter such that it is fault—recoverable (i.e. can be restarted afterinterruption without loss of data). In a preferred embodiment, thischeckpointing is non-blocking, meaning that the system does not need towait while the checkpointing is done before proceeding to process thenext batch of transactions. Handling this checkpointing is the task ofinternal checkpoint manager 812.

Referring to FIG. 25, when minimal recalculation engine 810 has computedthe effects of a batch of transactions on target table T1 2502, internalcheckpoint manager 812 is invoked for the table. Since internalcheckpoint manager 812 performs its checkpointing asynchronously,minimal recalculation engine 810 may immediately proceed withcalculation of the next target table, even if that target table dependson the values of table T1 2502.

Table T1 2502 has fields f1 2504, f2 2506, and so on, through fn 2508.There are also backups of these fields, f1 bu 2510, f2 bu 2512, and soon, through fnbu 2514. These backups may be referred to as “shadowmemory”, and are used to permit the logs to be applied asynchronously.Internal checkpoint manager 812 applies the logs to the backups of thefields, so that there will not yet be any effect on the data that may bein use by minimal recalculation engine 810.

When internal checkpoint manager 812 is finished applying the logs,fields f1 2504, f2 2506, through fn 2508 will still contain the olddata, but backup fields f1 bu 2510, f2 bu 2512, through fnbu 2514contain new field data.

The next time that all minimal recalculation engine(s) 810 finish withprocessing a group of transaction batches, the backup fields and theregular fields are swapped, so that the regular fields (i.e. the olddata) become the backup fields, and the backup fields (i.e. the newdata) become the regular fields. The swap is done by renaming thefields, rather than physically swapping the data, so it is a very fastand inexpensive operation. During the very brief period of time duringwhich the swap is taking place, the minimal recalculation engine(s) 810is required to wait, so that when it starts processing the next group oftransaction batches, it will generate logs based on the new field valuesof T1 2502. When this renaming occurs, the logs are also sent to theexternal checkpoint manager 814. Note that the period during which thisswap operation is performed corresponds to the “checkpoint1” statedescribed with reference to FIG. 13.

After the swap takes place, the backup fields once again contain the oldvalues for the fields of table T1 2502, so final checkpointing must takeplace, during which the same logs are applied to the backup fields. Oncethis is done, the regular fields and the backup fields are the same, andcheckpointing is complete. The logs may be stored for archival purposes,but it is no longer necessary to keep them in memory. Note that thissecond phase of checkpointing, during which the logs are being appliedto the backup fields, corresponds to the “checkpoint2“state describedwith references to FIG. 13.

Operation of the External Checkpoint Manager

The logs are also sent to external checkpoint manager 814. Externalcheckpoint manager 814 publishes the changes to target tables for whichthere are subscribers, such as the OLAP server. The logs are not appliedby external checkpoint manager 814. Rather, they are forwarded to thevarious subscribers, who may apply the logs, analyze the logs, orperform any other operations to the logs, and to a snapshot of theoriginal data (i.e. before applying logs) of the target tables to whichthey subscribe.

When a log gets too large, external checkpoint manager 114 may block,and apply the log, and let the subscribers know that a new version ofthe table to which they subscribe is now available. Note that in orderto permit subscribers to have a nearly real-time view of the changes tothe tables to which they subscribe, such blocking operations andresending of the entire table should be done infrequently. Typically itis more expensive for subscribers to a target table to re-read theentire table than to receive the incremental changes that have been madeto that table in the form of a log.

In a preferred embodiment, external checkpoint manager 814 communicateswith subscriber applications using a published protocol, so that newsubscriber applications can be built to communicate with externalcheckpoint manager 814.

A preferred embodiment of external checkpoint manager 814 providesaccess to files in database 200 in which columns of the logs are stored.Since information is continually being added to the column files of thelogs, external checkpoint manager 814 may take advantage of the factthat such files are contiguous in memory, and that changes to the logsare always appended to the end of the logs to provide a subscriber witha view of the column files for the logs that is static. This is done bysending subscribers header information about the column files thatindicates that the subscriber should read (or map into memory) only aportion of the column files (which may be growing) that was present atthe time that external checkpoint manager 814 published to subscribersthat the logs had been updated.

Many alterations and modifications may be made by those having ordinaryskill in the art without departing from the spirit and scope of theinvention. Therefore, it must be expressly understood that theillustrated embodiments has been shown only for the purposes of exampleand should not be taken as limiting the invention. The invention shouldtherefore be read to include all equivalent elements for performingsubstantially the same function in substantially the same way to obtainsubstantially the same result, even though not identical in otherrespects to what is shown and described in the above illustrations.

1. A method for transaction processing, comprising the steps of:specifying metadata describing database elements and relationshipsbetween the database elements; storing the metadata; creating a firstindex in response to the stored metadata; receiving a transaction;generating an index log of changes to the first index in response to thereceived transaction; and modifying the first index in response to thegenerated index log.
 2. The method of claim 1, further comprisingspecifying database elements required for a query.
 3. The method ofclaim 1, wherein the specifying step further comprises specifying datadependencies between database elements.
 4. The method of claim 1 whereinthe specifying step is performed by a user interface subsystem.
 5. Themethod of claim 1 wherein the receiving step, the generating step, andthe modifying step are performed by a transaction subsystem.
 6. Themethod of claim 1 wherein the transaction is an insert.
 7. The method ofclaim 1 wherein the transaction is an update.
 8. The method of claim 1wherein the transaction is a remove.
 9. The method of claim 8, furthercomprising: determining, based on the metadata and the log, additionaldatabase changes required by the transaction.
 10. The method of claim 1wherein the step of modifying the first index is performed when thetransactions are complete. 11.-38. (canceled)